import matplotlib
import sklearn
import pandas
import numpy
matplotlib.__version__
'3.4.3'
pandas.__version__
'1.3.2'
numpy.__version__
'1.19.5'
sklearn.__version__
'1.0.2'
!python --version
Python 3.8.0
import matplotlib.pyplot as plt
import sklearn
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
%matplotlib inline
df = pd.read_csv('D:/Novelis/Novelis_Code_Assessment/data/fraud_final_dataset.csv')
df.head()
| user_id | signup_time | purchase_time | elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 285108 | 7/15/2015 4:36 | 9/10/2015 14:17 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States |
| 1 | 131009 | 1/24/2015 12:29 | 4/13/2015 4:53 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom |
| 2 | 328855 | 3/11/2015 0:54 | 4/5/2015 12:23 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States |
| 3 | 229053 | 1/7/2015 13:19 | 1/9/2015 10:12 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of |
| 4 | 108439 | 2/8/2015 21:11 | 4/9/2015 14:26 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil |
Let's try to understand what each column means,
user_id: The user ID assigned to new users
signup_time: Time of the account creation
purchase_time: Time of the first purchase
elapsed_time : Time taken in months to make first transaction
purchase_value : Amount spent on purchase
device_id : The device ID which is unique by device
source : User marketing channel such as Direct, SEO or Advertisement
browser : The browser used by the user
Sex : Gender of the user
Age : Age of the user
ip_address : IP address of the device used
class : Transaction is fraudulent or not, 0 for non-fraudulent and 1 for fraudulent
country : Country of the user
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 120000 entries, 0 to 119999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 120000 non-null int64 1 signup_time 120000 non-null object 2 purchase_time 120000 non-null object 3 elapsed_time 120000 non-null int64 4 purchase_value 120000 non-null int64 5 device_id 120000 non-null object 6 source 120000 non-null object 7 browser 120000 non-null object 8 sex 120000 non-null object 9 age 120000 non-null int64 10 ip_address 120000 non-null float64 11 class 120000 non-null int64 12 country 120000 non-null object dtypes: float64(1), int64(5), object(7) memory usage: 11.9+ MB
Luckily, we don't have any missing values. So let's start feature selection.
Here, let's start with checking the imbalance of the labels,
df['class'].value_counts()
0 108735 1 11265 Name: class, dtype: int64
Here, the 0 represents that the transaction wasn't fraudulent and 1 represents that the transaction was fraudulent. We have 108735 records of no fraud detected vs 11265 records of fraud detected. This shows that the dataset is imbalanced, but not that imbalanced as we see in some credit card fraud detection datasets.
Let's visualize the impact of different variables
df.head()
| user_id | signup_time | purchase_time | elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 285108 | 7/15/2015 4:36 | 9/10/2015 14:17 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States |
| 1 | 131009 | 1/24/2015 12:29 | 4/13/2015 4:53 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom |
| 2 | 328855 | 3/11/2015 0:54 | 4/5/2015 12:23 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States |
| 3 | 229053 | 1/7/2015 13:19 | 1/9/2015 10:12 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of |
| 4 | 108439 | 2/8/2015 21:11 | 4/9/2015 14:26 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil |
fraud_commited = [len(df[(df['sex']=='M') & (df['class']==1)]),
len(df[(df['sex']=='F') & (df['class']==1)])]
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,6))
fig.suptitle('Gender wise comparion in Number of Users vs Number of Frauds commited')
ax1.bar(df['sex'].value_counts().index,df['sex'].value_counts().values,color=['tab:orange','tab:blue'])
ax1.set_title('Total Users')
ax1.set_xticklabels(['Male','Female'])
ax2.bar(df['sex'].value_counts().index,fraud_commited,color=['tab:orange','tab:blue'])
ax2.set_title('Total Frauds')
ax2.set_xticklabels(['Male','Female'])
plt.show()
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/835803360.py:5: UserWarning: FixedFormatter should only be used together with FixedLocator ax1.set_xticklabels(['Male','Female']) C:\Users\madha\AppData\Local\Temp/ipykernel_16240/835803360.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator ax2.set_xticklabels(['Male','Female'])
Now, let's analyze the relation of age of the user and frauds commited
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,6))
fig.suptitle('Age wise comparion in Number of Users vs Number of Frauds commited')
bins=[0,18,30,45,60,100]
temp_df = df.groupby(pd.cut(df['age'], bins=bins)).age.count()
temp_df.plot(kind='bar',ax=ax1)
ax1.set_title('Age-wise total users')
temp_df = df[df['class']==1].groupby(pd.cut(df[df['class']==1]['age'],bins)).age.count()
temp_df.plot(kind='bar',ax=ax2)
ax2.set_title('Age-wise total frauds commited')
plt.show()
Here, we can see that most number of frauds are commited by someone aged between 30 and 45. Also, the reason behind that is that the number of users in that age bracket is maximum. Also, the distribution remains constant.
df['source'].value_counts()
SEO 48297 Ads 47461 Direct 24242 Name: source, dtype: int64
fraud_commited = [len(df[(df['source']=='SEO') & (df['class']==1)]),
len(df[(df['source']=='Ads') & (df['class']==1)]),
len(df[(df['source']=='Direct') & (df['class']==1)])]
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,6))
fig.suptitle('Source wise comparion in Number of Users vs Number of Frauds commited')
ax1.bar(df['source'].value_counts().index,df['source'].value_counts().values,color=['tab:orange','tab:blue','tab:green'])
ax1.set_title('Total Users')
ax1.set_xticklabels(['SEO','Ads','Direct'])
ax2.bar(df['source'].value_counts().index,fraud_commited,color=['tab:orange','tab:blue','tab:green'])
ax2.set_title('Total Frauds')
ax2.set_xticklabels(['SEO','Ads','Direct'])
plt.show()
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/245099088.py:5: UserWarning: FixedFormatter should only be used together with FixedLocator ax1.set_xticklabels(['SEO','Ads','Direct']) C:\Users\madha\AppData\Local\Temp/ipykernel_16240/245099088.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator ax2.set_xticklabels(['SEO','Ads','Direct'])
Above, the distribution is almost similar but Users who directly come to the website tend to commit more frauds than compared to total users who visit the site directly. But it is not significant.
df['browser'].value_counts()
Chrome 48652 IE 29138 Safari 19620 FireFox 19615 Opera 2975 Name: browser, dtype: int64
fraud_commited = [len(df[(df['browser']=='Chrome') & (df['class']==1)]),
len(df[(df['browser']=='IE') & (df['class']==1)]),
len(df[(df['browser']=='Safari') & (df['class']==1)]),
len(df[(df['browser']=='FireFox') & (df['class']==1)]),
len(df[(df['browser']=='Opera') & (df['class']==1)])]
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,6))
fig.suptitle('Browser wise comparion in Number of Users vs Number of Frauds commited')
ax1.bar(df['browser'].value_counts().index,df['browser'].value_counts().values,color=['tab:orange','tab:blue','tab:green','tab:red','tab:gray'])
ax1.set_title('Total Users')
ax1.set_xticklabels(['Chrome','IE','Safari','FireFox','Opera'])
ax2.bar(df['browser'].value_counts().index,fraud_commited,color=['tab:orange','tab:blue','tab:green','tab:red','tab:gray'])
ax2.set_title('Total Frauds')
ax2.set_xticklabels(['Chrome','IE','Safari','FireFox','Opera'])
plt.show()
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/2210570646.py:5: UserWarning: FixedFormatter should only be used together with FixedLocator ax1.set_xticklabels(['Chrome','IE','Safari','FireFox','Opera']) C:\Users\madha\AppData\Local\Temp/ipykernel_16240/2210570646.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator ax2.set_xticklabels(['Chrome','IE','Safari','FireFox','Opera'])
We can depict from above that chrome is most used browser by users and hence, most used browser when committing fraud.
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,6))
fig.suptitle('Purchase Value wise comparion in Number of Users vs Number of Frauds commited')
bins=[0,20,40,60,80,150]
temp_df = df.groupby(pd.cut(df['purchase_value'], bins=bins)).purchase_value.count()
temp_df.plot(kind='bar',ax=ax1)
ax1.set_title('Purchase value for total purchases')
temp_df = df[df['class']==1].groupby(pd.cut(df[df['class']==1]['purchase_value'],bins)).purchase_value.count()
temp_df.plot(kind='bar',ax=ax2)
ax2.set_title('Purchase value when frauds were commited')
plt.show()
Here as well, the distribution is quite similar. More purchases are cheaper and not a lot of money is spent when fraud takes place.
Now, to analyze the date or time, we need to convert the purchase_date and signup_date into something more meaningful.
df[['signup_date','signup_time']] = df['signup_time'].str.split(' ',1,expand=True)
df[['purchase_date','purchase_time']] = df['purchase_time'].str.split(' ',1,expand=True)
df['signup_date'] = pd.to_datetime(df['signup_date'], format='%m/%d/%Y')
df['purchase_date'] = pd.to_datetime(df['purchase_date'],format='%m/%d/%Y')
df['signup_hour'] = pd.to_datetime(df['signup_time'],format='%H:%M').dt.hour
df['purchase_hour'] = pd.to_datetime(df['purchase_time'],format='%H:%M').dt.hour
df['signup_dayoftheweek'] = df['signup_date'].dt.dayofweek
df['purchase_dayoftheweek'] = df['purchase_date'].dt.dayofweek
df['signup_day_of_the_month'] = pd.to_datetime(df['signup_date'], format='%d').dt.day
df['purchase_day_of_the_month'] = pd.to_datetime(df['purchase_date'],format='%d').dt.day
df['signup_month'] = pd.to_datetime(df['signup_date'], format='%m').dt.month
df['purchase_month'] = pd.to_datetime(df['purchase_date'],format='%m').dt.month
df['elapsed_time_weeks'] = (df['purchase_date']- df['signup_date']).dt.days//7
df.drop(['signup_time','purchase_time'],axis=1,inplace=True)
df.head()
| user_id | elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | signup_date | purchase_date | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 285108 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States | 2015-07-15 | 2015-09-10 | 4 | 14 | 2 | 3 | 15 | 10 | 7 | 9 | 8 |
| 1 | 131009 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom | 2015-01-24 | 2015-04-13 | 12 | 4 | 5 | 0 | 24 | 13 | 1 | 4 | 11 |
| 2 | 328855 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States | 2015-03-11 | 2015-04-05 | 0 | 12 | 2 | 6 | 11 | 5 | 3 | 4 | 3 |
| 3 | 229053 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of | 2015-01-07 | 2015-01-09 | 13 | 10 | 2 | 4 | 7 | 9 | 1 | 1 | 0 |
| 4 | 108439 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil | 2015-02-08 | 2015-04-09 | 21 | 14 | 6 | 3 | 8 | 9 | 2 | 4 | 8 |
arr = df['purchase_hour']
# for i in range(len(df['purchase_hour'].value_counts())):
# print(df['purchase_hour'].value_counts().index[i],len(df[(df['purchase_hour']==df['purchase_hour'].value_counts().index[i]) & df['class']==1]))
N = 24
bottom = 2
# create theta for 24 hours
theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
# make the histogram that bined on 24 hour
radii, tick = np.histogram(arr, bins = 24)
# width of each bin on the plot
width = (2*np.pi) / N
# make a polar plot
plt.figure(figsize = (12, 8))
ax = plt.subplot(111, polar=True)
bars = ax.bar(theta, radii, width=width, bottom=bottom)
# set the lable go clockwise and start from the top
ax.set_theta_zero_location("N")
# clockwise
ax.set_theta_direction(-1)
# set the label
ticks = ['0:00', '3:00', '6:00', '9:00', '12:00', '15:00', '18:00', '21:00']
ax.set_xticklabels(ticks)
ax.set_title('24 hours distribution of time of purchase')
plt.show()
#code referenced:http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/2602610363.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(ticks)
arr = df[df['class']==1]['purchase_hour']
N = 24
bottom = 2
# create theta for 24 hours
theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
# make the histogram that bined on 24 hour
radii, tick = np.histogram(arr, bins = 24)
# width of each bin on the plot
width = (2*np.pi) / N
# make a polar plot
plt.figure(figsize = (12, 8))
ax = plt.subplot(111, polar=True,)
bars = ax.bar(theta, radii, width=width, bottom=bottom)
# set the lable go clockwise and start from the top
ax.set_theta_zero_location("N")
# clockwise
ax.set_theta_direction(-1)
# set the label
ticks = ['0:00', '3:00', '6:00', '9:00', '12:00', '15:00', '18:00', '21:00']
ax.set_xticklabels(ticks)
ax.set_title('24 hours distribution of purchases made that were fradulent')
plt.show()
#code referenced:http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/3229585163.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(ticks)
Above, we can see that most fradulent purchases are made at around 9 am and at 5 pm.
arr = df['signup_hour']
N = 24
bottom = 2
# create theta for 24 hours
theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
# make the histogram that bined on 24 hour
radii, tick = np.histogram(arr, bins = 24)
# width of each bin on the plot
width = (2*np.pi) / N
# make a polar plot
plt.figure(figsize = (12, 8))
ax = plt.subplot(111, polar=True)
bars = ax.bar(theta, radii, width=width, bottom=bottom)
# set the lable go clockwise and start from the top
ax.set_theta_zero_location("N")
# clockwise
ax.set_theta_direction(-1)
# set the label
ticks = ['0:00', '3:00', '6:00', '9:00', '12:00', '15:00', '18:00', '21:00']
ax.set_xticklabels(ticks)
ax.set_title('24 hours distribution of time of signup')
plt.show()
#code referenced:http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/3937300806.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(ticks)
arr = df[df['class']==1]['signup_hour']
N = 24
bottom = 2
# create theta for 24 hours
theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
# make the histogram that bined on 24 hour
radii, tick = np.histogram(arr, bins = 24)
# width of each bin on the plot
width = (2*np.pi) / N
# make a polar plot
plt.figure(figsize = (12, 8))
ax = plt.subplot(111, polar=True,)
bars = ax.bar(theta, radii, width=width, bottom=bottom)
# set the lable go clockwise and start from the top
ax.set_theta_zero_location("N")
# clockwise
ax.set_theta_direction(-1)
# set the label
ticks = ['0:00', '3:00', '6:00', '9:00', '12:00', '15:00', '18:00', '21:00']
ax.set_xticklabels(ticks)
ax.set_title('24 hours distribution of signups that were fradulent')
plt.show()
#code referenced:http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html
C:\Users\madha\AppData\Local\Temp/ipykernel_16240/1226839425.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(ticks)
Here, we can see that fraudulent signups are made at 9 am and 5 pm. Similar trend is seen in purchase time.
df.head()
| user_id | elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | signup_date | purchase_date | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 285108 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States | 2015-07-15 | 2015-09-10 | 4 | 14 | 2 | 3 | 15 | 10 | 7 | 9 | 8 |
| 1 | 131009 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom | 2015-01-24 | 2015-04-13 | 12 | 4 | 5 | 0 | 24 | 13 | 1 | 4 | 11 |
| 2 | 328855 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States | 2015-03-11 | 2015-04-05 | 0 | 12 | 2 | 6 | 11 | 5 | 3 | 4 | 3 |
| 3 | 229053 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of | 2015-01-07 | 2015-01-09 | 13 | 10 | 2 | 4 | 7 | 9 | 1 | 1 | 0 |
| 4 | 108439 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil | 2015-02-08 | 2015-04-09 | 21 | 14 | 6 | 3 | 8 | 9 | 2 | 4 | 8 |
fig,(ax1,ax2) = plt.subplots(2,1,figsize=(20,5))
fig.tight_layout(pad=3.0)
fig.suptitle('Day of the month - Total purchases made Vs Total purchases made that were fraudulent',y=1.1,fontsize=20)
ax1.hist(df['purchase_day_of_the_month'],bins=np.arange(1,33),rwidth=0.8)
ax1.set_title('Day of the month the purchases were made')
ax2.hist(df[df['class']==1]['purchase_day_of_the_month'],bins=np.arange(1,33),rwidth=0.8)
ax2.set_title('Day of the month the purchases were made that were fraudelent')
plt.show()
Surprisingly, we have extremey low number of fraudulent purchases in the end of the month.
fig,(ax1,ax2) = plt.subplots(2,1,figsize=(20,5))
fig.tight_layout(pad=3.0)
fig.suptitle('Day of the month - Total signups made Vs Total signups made that were fraudulent',y=1.1,fontsize=20)
ax1.hist(df['signup_day_of_the_month'],bins=np.arange(1,33),rwidth=0.8)
ax1.set_title('Day of the month the signups were made')
ax2.hist(df[df['class']==1]['signup_day_of_the_month'],bins=np.arange(1,33),rwidth=0.8)
ax2.set_title('Day of the month the signups were made that were fraudelent')
plt.show()
Similar trend as purchase date can be seen in signup as well.
df[df['class']==1]['elapsed_time'].value_counts()
0 6698 2 1331 1 1325 3 1323 4 588 Name: elapsed_time, dtype: int64
fig,(ax1,ax2) = plt.subplots(2,1,figsize=(20,5))
fig.tight_layout(pad=3.0)
fig.suptitle('Elapsed time in month - All transactions Vs Fraudulent transaction',y=1.1,fontsize=20)
ax1.hist(df['elapsed_time'],bins=np.arange(-0.5,5.5),rwidth=0.8)
ax1.set_title('Elapsed time in month for all transactions')
ax2.hist(df[df['class']==1]['elapsed_time'],bins=np.arange(-0.5,5.5),rwidth=0.8)
ax2.set_title('Elapsed time in month for all fraudulent transactions')
plt.show()
Here, it is obvious that most fraudulent transactions occur within 30 days of signing up. Thus, this can be an important feature for us.
Now, this is enough for EDA. Let's now move on to feature selection where we would use the knowledge gained during this process.
Here, we first drop the user_id column as we don't really get any insight from it. Moreover, as the dataset has the first transaction of the user so we are only going to have one record per user. Also, we can see below that one user_id is used only once.
max(df['user_id'].value_counts())
1
df.drop(['user_id'],axis=1,inplace=True)
df.head()
| elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | signup_date | purchase_date | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States | 2015-07-15 | 2015-09-10 | 4 | 14 | 2 | 3 | 15 | 10 | 7 | 9 | 8 |
| 1 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom | 2015-01-24 | 2015-04-13 | 12 | 4 | 5 | 0 | 24 | 13 | 1 | 4 | 11 |
| 2 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States | 2015-03-11 | 2015-04-05 | 0 | 12 | 2 | 6 | 11 | 5 | 3 | 4 | 3 |
| 3 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of | 2015-01-07 | 2015-01-09 | 13 | 10 | 2 | 4 | 7 | 9 | 1 | 1 | 0 |
| 4 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil | 2015-02-08 | 2015-04-09 | 21 | 14 | 6 | 3 | 8 | 9 | 2 | 4 | 8 |
Now, let's try to convert categorical variable into something that we can use for our classification model.
First, let's try to anaylze the country:
print("Total countries: {0}".format(len(df['country'].value_counts())))
Total countries: 177
Also, let's analyze the country-wise frauds
country_list = df['country'].value_counts()
frauds_ratio = []
for i in range(len(country_list)):
frauds = len([1 for j in range(len(df)) if df.loc[j,'country']==country_list.index[i] and df.loc[j,'class']==1])
total_users_in_country = country_list[i]
frauds_ratio.append(round(frauds/total_users_in_country,5))
if i < 70:
print('Country : {:40s} Total Users {:4d} \t Frauds Ratio:{}'.format(country_list.index[i],total_users_in_country,round(frauds/total_users_in_country,5)))
Country : United States Total Users 46184 Frauds Ratio:0.09683 Country : Unknown Total Users 17418 Frauds Ratio:0.08514 Country : China Total Users 9532 Frauds Ratio:0.08571 Country : Japan Total Users 5735 Frauds Ratio:0.0966 Country : United Kingdom Total Users 3580 Frauds Ratio:0.10391 Country : Korea Republic of Total Users 3341 Frauds Ratio:0.09159 Country : Germany Total Users 2890 Frauds Ratio:0.07301 Country : France Total Users 2489 Frauds Ratio:0.09361 Country : Brazil Total Users 2353 Frauds Ratio:0.09265 Country : Canada Total Users 2344 Frauds Ratio:0.11519 Country : Italy Total Users 1564 Frauds Ratio:0.08504 Country : Australia Total Users 1491 Frauds Ratio:0.09054 Country : Netherlands Total Users 1325 Frauds Ratio:0.07547 Country : Russian Federation Total Users 1281 Frauds Ratio:0.07806 Country : India Total Users 1014 Frauds Ratio:0.11538 Country : Taiwan; Republic of China (ROC) Total Users 967 Frauds Ratio:0.0848 Country : Mexico Total Users 919 Frauds Ratio:0.12296 Country : Spain Total Users 842 Frauds Ratio:0.07007 Country : Sweden Total Users 842 Frauds Ratio:0.12708 Country : South Africa Total Users 655 Frauds Ratio:0.09466 Country : Switzerland Total Users 639 Frauds Ratio:0.08138 Country : Poland Total Users 586 Frauds Ratio:0.0529 Country : Indonesia Total Users 520 Frauds Ratio:0.08846 Country : Argentina Total Users 513 Frauds Ratio:0.10916 Country : Norway Total Users 480 Frauds Ratio:0.12917 Country : Colombia Total Users 471 Frauds Ratio:0.07006 Country : Turkey Total Users 456 Frauds Ratio:0.07237 Country : Viet Nam Total Users 431 Frauds Ratio:0.06265 Country : Romania Total Users 415 Frauds Ratio:0.05542 Country : Denmark Total Users 381 Frauds Ratio:0.16273 Country : Hong Kong Total Users 368 Frauds Ratio:0.13043 Country : Finland Total Users 363 Frauds Ratio:0.10744 Country : Ukraine Total Users 358 Frauds Ratio:0.1257 Country : Austria Total Users 349 Frauds Ratio:0.0745 Country : Chile Total Users 327 Frauds Ratio:0.14985 Country : Belgium Total Users 309 Frauds Ratio:0.14563 Country : Iran (ISLAMIC Republic Of) Total Users 304 Frauds Ratio:0.08882 Country : Czech Republic Total Users 291 Frauds Ratio:0.08935 Country : Egypt Total Users 285 Frauds Ratio:0.10877 Country : Thailand Total Users 225 Frauds Ratio:0.06667 Country : New Zealand Total Users 222 Frauds Ratio:0.21171 Country : Saudi Arabia Total Users 217 Frauds Ratio:0.20737 Country : Israel Total Users 216 Frauds Ratio:0.03704 Country : Venezuela Total Users 200 Frauds Ratio:0.15 Country : European Union Total Users 198 Frauds Ratio:0.06566 Country : Ireland Total Users 197 Frauds Ratio:0.23858 Country : Portugal Total Users 186 Frauds Ratio:0.04839 Country : Greece Total Users 175 Frauds Ratio:0.13714 Country : Hungary Total Users 170 Frauds Ratio:0.09412 Country : Malaysia Total Users 169 Frauds Ratio:0.05325 Country : Singapore Total Users 166 Frauds Ratio:0.07229 Country : Pakistan Total Users 152 Frauds Ratio:0.04605 Country : Morocco Total Users 134 Frauds Ratio:0.02239 Country : Philippines Total Users 133 Frauds Ratio:0.06015 Country : Bulgaria Total Users 130 Frauds Ratio:0.00769 Country : Algeria Total Users 98 Frauds Ratio:0.10204 Country : United Arab Emirates Total Users 96 Frauds Ratio:0.13542 Country : Peru Total Users 94 Frauds Ratio:0.28723 Country : Tunisia Total Users 89 Frauds Ratio:0.25843 Country : Ecuador Total Users 87 Frauds Ratio:0.26437 Country : Kenya Total Users 77 Frauds Ratio:0.07792 Country : Seychelles Total Users 77 Frauds Ratio:0.1039 Country : Lithuania Total Users 76 Frauds Ratio:0.21053 Country : Kuwait Total Users 72 Frauds Ratio:0.23611 Country : Slovenia Total Users 69 Frauds Ratio:0.0 Country : Kazakhstan Total Users 67 Frauds Ratio:0.04478 Country : Slovakia (SLOVAK Republic) Total Users 67 Frauds Ratio:0.04478 Country : Costa Rica Total Users 66 Frauds Ratio:0.12121 Country : Croatia (LOCAL Name: Hrvatska) Total Users 64 Frauds Ratio:0.07812 Country : Uruguay Total Users 63 Frauds Ratio:0.06349
Here, we have a choice to make:
1) We can either drop the country columns as ratio of (Number of Frauds/Number of Users) is almost similar in most cases, and this way we won't be losing much of the information
2) We use one-hot encoding/hashing and convert this categorical variable which we can use in our model training
Though, before dropping the column, we need to note that some countries like Saudi Arabia, Sri Lanka and other middle eastern countries have higher fraud ratio compared to other countries.
Let's apply one-hot encoding to it and try to analyze it's importance.
df[df['class']==1]['country'].value_counts()
United States 4472
Unknown 1483
China 817
Japan 554
United Kingdom 372
...
Angola 1
Serbia 1
Virgin Islands (U.S.) 1
Iceland 1
Guatemala 1
Name: country, Length: 107, dtype: int64
Here, we have Unknown country as the second highest cases of fraud cases reported. We have another decision to make here.
1) Do we drop the rows with Unknown country?
2) Do we use in hashing?
Considering that we will have limited data left if we drop the rows with Unknown country name, we can't choose that option. In many cases, since we are getting the Country name from the IP address, the chances are that IP would be difficult to track. So let's just continue with considering Unknown as just another value.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
X = df.loc[:, df.columns != 'class']
y = df.loc[:,'class']
one_hot = pd.get_dummies(X[['country','source','browser','sex']])
X_train, X_test, y_train, y_test = train_test_split(one_hot, y, test_size=0.33, random_state=1)
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier())
selected_feat= one_hot.columns[(sel.get_support())]
len(selected_feat)
61
for i in range(len(one_hot.columns)):
one_hot.columns.values[i] = one_hot.columns.values[i].split('_')[1]
plt.figure(figsize=(25,10))
plt.bar(X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:20]],np.sort(np.array(sel.estimator_.feature_importances_))[::-1][:20])
plt.ylabel('Feature Importance')
plt.xlabel('Variables')
plt.show()
Above, we can analyze that countries with higher ration of frauds have higher feature importance. Countries like Ireland, Ecuador, Luxembourg, etc have higher ratio compared to other countries so they are more relevent.
So In my opinion, we should keep the Country column since we are making a generalized model which would have users from all countries.
Country column and deal with the cardinality issue faced in RandomForest.¶Since we are using RandomForest, we can use Boruta as well for feature importance, but since we are not going to use any other library other than Scikit-learn, we are going to rely on results we got above.
Let's try Hashing now.
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10, input_type='string')
f = h.transform(X['country'])
df_hashing = pd.DataFrame(f.toarray(),columns=['1','2','3','4','5','6','7','8','9','10'])
one_hot = pd.get_dummies(X[['source','browser','sex']])
df_hashing = pd.concat([df_hashing,one_hot],axis=1)
X_train, X_test, y_train, y_test = train_test_split(df_hashing, y, test_size=0.33, random_state=1)
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier())
for i in range(len(df_hashing.columns)):
if len(df_hashing.columns.values[i].split('_'))!=1:
df_hashing.columns.values[i] = df_hashing.columns.values[i].split('_')[1]
plt.figure(figsize=(25,10))
plt.bar(X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:20]],np.sort(np.array(sel.estimator_.feature_importances_))[::-1][:20])
plt.ylabel('Feature Importance')
plt.xlabel('Variables')
plt.show()
Here, we can see that hashed variables have more feature importance than some other variables. We can use this transformed categorical transformation.
Let's drop the country column and replace it with hashed variables, and replace sex,source and browser with one-hot encoded features.
Now that we know hashing can be useful, we can use it along with other features to compare.
Let's split the data in to train and test and start preparing our features.
df.head()
| elapsed_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | signup_date | purchase_date | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 31 | HZAKVUFTDOSFD | Direct | Chrome | M | 49 | 2.818400e+09 | 0 | United States | 2015-07-15 | 2015-09-10 | 4 | 14 | 2 | 3 | 15 | 10 | 7 | 9 | 8 |
| 1 | 3 | 31 | XGQAJSOUJIZCC | SEO | IE | F | 21 | 3.251268e+09 | 0 | United Kingdom | 2015-01-24 | 2015-04-13 | 12 | 4 | 5 | 0 | 24 | 13 | 1 | 4 | 11 |
| 2 | 1 | 16 | VCCTAYDCWKZIY | Direct | IE | M | 26 | 2.727760e+09 | 0 | United States | 2015-03-11 | 2015-04-05 | 0 | 12 | 2 | 6 | 11 | 5 | 3 | 4 | 3 |
| 3 | 0 | 29 | MFFIHYNXCJLEY | SEO | Chrome | M | 34 | 2.083420e+09 | 0 | Korea Republic of | 2015-01-07 | 2015-01-09 | 13 | 10 | 2 | 4 | 7 | 9 | 1 | 1 | 0 |
| 4 | 2 | 26 | WMSXWGVPNIFBM | Ads | FireFox | M | 33 | 3.207913e+09 | 0 | Brazil | 2015-02-08 | 2015-04-09 | 21 | 14 | 6 | 3 | 8 | 9 | 2 | 4 | 8 |
X = df.loc[:, df.columns != 'class']
y = df.loc[:,'class']
one_hot = pd.get_dummies(X[['source','browser','sex']])
X = pd.concat([X,one_hot],axis=1)
X.drop(['source','browser','sex'],axis=1,inplace=True)
X.drop(['signup_date','purchase_date'],axis=1,inplace=True)
X
| elapsed_time | purchase_value | device_id | age | ip_address | country | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 31 | HZAKVUFTDOSFD | 49 | 2.818400e+09 | United States | 4 | 14 | 2 | 3 | 15 | 10 | 7 | 9 | 8 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 3 | 31 | XGQAJSOUJIZCC | 21 | 3.251268e+09 | United Kingdom | 12 | 4 | 5 | 0 | 24 | 13 | 1 | 4 | 11 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 1 | 16 | VCCTAYDCWKZIY | 26 | 2.727760e+09 | United States | 0 | 12 | 2 | 6 | 11 | 5 | 3 | 4 | 3 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 0 | 29 | MFFIHYNXCJLEY | 34 | 2.083420e+09 | Korea Republic of | 13 | 10 | 2 | 4 | 7 | 9 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 2 | 26 | WMSXWGVPNIFBM | 33 | 3.207913e+09 | Brazil | 21 | 14 | 6 | 3 | 8 | 9 | 2 | 4 | 8 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119995 | 2 | 46 | UJYRDGZXTFFJG | 18 | 2.509395e+09 | Netherlands | 11 | 22 | 3 | 3 | 26 | 16 | 2 | 4 | 7 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 119996 | 0 | 26 | EMMTCPTUYQYPX | 36 | 2.946612e+09 | China | 18 | 7 | 5 | 1 | 1 | 25 | 8 | 8 | 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 119997 | 2 | 41 | YSZGGEARGETEU | 31 | 5.570629e+08 | United States | 12 | 4 | 5 | 3 | 25 | 3 | 7 | 9 | 5 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 119998 | 2 | 50 | BJDWRJULJZNOV | 43 | 2.687887e+09 | Switzerland | 21 | 16 | 3 | 0 | 2 | 22 | 4 | 6 | 11 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 119999 | 2 | 32 | AOKZUNMPCDKVK | 47 | 1.174840e+09 | United States | 15 | 13 | 2 | 3 | 15 | 3 | 7 | 9 | 7 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
120000 rows × 25 columns
X_train_temp, X_test_temp, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1,stratify=y)
device_id and ip_address next.¶ip_address_value_count = X_train_temp['ip_address'].value_counts()
ip_address_value_count
3.874758e+09 18
2.979623e+09 14
1.797069e+09 14
2.050964e+09 13
2.470359e+09 13
..
2.241803e+09 1
1.485240e+09 1
4.017842e+09 1
2.177170e+09 1
2.382439e+09 1
Name: ip_address, Length: 76746, dtype: int64
X_train_temp[X_train_temp['ip_address']==X_train_temp['ip_address'].value_counts().index[0]]
| elapsed_time | purchase_value | device_id | age | ip_address | country | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 77249 | 3 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 0 | 5 | 1 | 10 | 28 | 1 | 4 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 6064 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 75401 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 97601 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 27019 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 18867 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 92593 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 10680 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 119869 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 95591 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 95593 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 113060 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 37391 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 39401 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 95050 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 63373 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 53968 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 79531 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Here, we can see that there is unknown country as well. We can see if we can do something to solve that.
Table above shows that the possibility of fraud is maximum when ip_address is used before. We need to convert it so that we can effectively use it to train our model.
First, let's create a dicionary to check the number of times the particular IP Address is used. We would need this dictionary to track the amount of time in the test data as well.
ip_address_used = {}
X_train_temp.reset_index(drop=True,inplace=True)
X_train = X_train_temp.copy()
def assign_count_ip_address(dataframe,ip_address_used):
for i in range(len(dataframe)):
if dataframe.loc[i,'ip_address'] in ip_address_used.keys():
ip_address_used[dataframe.loc[i,'ip_address']] += 1
else:
ip_address_used[dataframe.loc[i,'ip_address']] = 1
return ip_address_used
ip_address_used = assign_count_ip_address(X_train_temp,ip_address_used)
X_train.loc[:,'Number_of_times_IP_used_before'] = [ip_address_used[i] for i in X_train_temp['ip_address']]
X_test_temp.reset_index(drop=True,inplace=True)
X_test = X_test_temp.copy()
ip_address_used = assign_count_ip_address(X_test,ip_address_used)
X_test.loc[:,'Number_of_times_IP_used_before'] = [ip_address_used[i] for i in X_test['ip_address']]
device_id.¶device_id_value_count = X_train_temp['device_id'].value_counts()
device_id_value_count
ITUMJCKWEYNDD 18
EQYVNEGOFLAWK 14
FVYSKVOAMYIZM 14
UFBULQADXSSOG 13
NGQCKIADMZORL 13
..
BOWAMGNCBBLSJ 1
EJMIHXCLRZSQI 1
JQBPZZWPYICGK 1
NGQCVRMWNLPYI 1
HFKIWCYJGWOVZ 1
Name: device_id, Length: 75172, dtype: int64
X_train_temp[X_train_temp['device_id']==X_train_temp['device_id'].value_counts().index[0]]
| elapsed_time | purchase_value | device_id | age | ip_address | country | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6121 | 3 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 0 | 5 | 1 | 10 | 28 | 1 | 4 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 11091 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 15558 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 24665 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 25212 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 29375 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 32203 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 32362 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 34410 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 37124 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 41883 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 44392 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 44554 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 53671 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 57942 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 58351 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 64592 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 75262 | 0 | 38 | ITUMJCKWEYNDD | 43 | 3.874758e+09 | Unknown | 23 | 23 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
device_id as devices used multiple times have higher possibility of being fraudulent¶device_id_used = {}
def assign_count_device_id(dataframe,device_id_used):
for i in range(len(dataframe)):
if dataframe.loc[i,'device_id'] in device_id_used.keys():
device_id_used[dataframe.loc[i,'device_id']] += 1
else:
device_id_used[dataframe.loc[i,'device_id']] = 1
return device_id_used
device_id_used = assign_count_device_id(X_train_temp,device_id_used)
X_train.loc[:,'Number_of_times_device_ID_used_before'] = [device_id_used[i] for i in X_train_temp['device_id']]
X_train.drop(['device_id','ip_address'],axis=1,inplace=True)
device_id_used = assign_count_device_id(X_test_temp,device_id_used)
X_test.loc[:,'Number_of_times_device_ID_used_before'] = [device_id_used[i] for i in X_test_temp['device_id']]
X_test.drop(['device_id','ip_address'],axis=1,inplace=True)
X_train.head()
| elapsed_time | purchase_value | age | country | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | Number_of_times_IP_used_before | Number_of_times_device_ID_used_before | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 26 | 35 | United States | 11 | 11 | 2 | 5 | 1 | 29 | 7 | 8 | 8 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 1 | 2 | 30 | 35 | Unknown | 16 | 23 | 2 | 5 | 8 | 19 | 7 | 9 | 10 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 2 | 2 | 21 | 23 | United States | 5 | 0 | 6 | 5 | 1 | 9 | 3 | 5 | 9 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
| 3 | 3 | 94 | 30 | Sweden | 9 | 17 | 3 | 1 | 28 | 25 | 5 | 8 | 12 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
| 4 | 0 | 59 | 33 | Korea Republic of | 21 | 21 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 9 |
h = FeatureHasher(n_features=10, input_type='string')
f = h.transform(X_train['country'])
df_hashing = pd.DataFrame(f.toarray(),columns=['country_1','country_2','country_3','country_4','country_5','country_6','country_7',
'country_8','country_9','country_10'])
X_train = pd.concat([X_train,df_hashing],axis=1)
X_train.drop(['country'],axis=1,inplace=True)
X_train.head()
| elapsed_time | purchase_value | age | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | Number_of_times_IP_used_before | Number_of_times_device_ID_used_before | country_1 | country_2 | country_3 | country_4 | country_5 | country_6 | country_7 | country_8 | country_9 | country_10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 26 | 35 | 11 | 11 | 2 | 5 | 1 | 29 | 7 | 8 | 8 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.0 | -1.0 | 1.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 |
| 1 | 2 | 30 | 35 | 16 | 23 | 2 | 5 | 8 | 19 | 7 | 9 | 10 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -1.0 | -2.0 | 0.0 | 1.0 | -3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2 | 21 | 23 | 5 | 0 | 6 | 5 | 1 | 9 | 3 | 5 | 9 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.0 | -1.0 | 1.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 |
| 3 | 3 | 94 | 30 | 9 | 17 | 3 | 1 | 28 | 25 | 5 | 8 | 12 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0.0 | -1.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | -1.0 | 3.0 |
| 4 | 0 | 59 | 33 | 21 | 21 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 9 | 0.0 | -2.0 | 4.0 | 3.0 | 0.0 | 1.0 | -1.0 | -1.0 | 0.0 | 3.0 |
Now, this is the dataframe that we can use to compute feature importance
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier())
fig,(ax1,ax2,ax3) = plt.subplots(3,1,figsize=(20,8))
fig.tight_layout(pad=3.0)
fig.suptitle('Feature Importance in descending order',y=1.1,fontsize=20)
ax1.bar(X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:7]],np.sort(np.array(sel.estimator_.feature_importances_))[::-1][:7])
ax1.set_ylabel('Feature Importance')
ax1.set_xlabel('Variables')
ax1.grid()
ax2.bar(X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][7:18]],np.sort(np.array(sel.estimator_.feature_importances_))[::-1][7:18])
ax2.set_ylabel('Feature Importance')
ax2.set_xlabel('Variables')
ax2.grid()
ax3.bar(X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][18:]],np.sort(np.array(sel.estimator_.feature_importances_))[::-1][18:])
ax3.set_ylabel('Feature Importance')
ax3.set_xlabel('Variables')
ax3.grid()
plt.show()
As we can see, the added variables such as Number_of_times_IP_used_before, Number_of_times_device_ID_used_before, purchase_month,elapsed_time_weeks, etc have significantly higher feature importance.
X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:7]]
Index(['Number_of_times_device_ID_used_before',
'Number_of_times_IP_used_before', 'elapsed_time_weeks',
'purchase_month', 'purchase_day_of_the_month', 'elapsed_time',
'signup_day_of_the_month'],
dtype='object')
Let's prepare the test dataset
f = h.transform(X_test['country'])
df_hashing = pd.DataFrame(f.toarray(),columns=['country_1','country_2','country_3','country_4','country_5','country_6','country_7',
'country_8','country_9','country_10'])
X_test = pd.concat([X_test,df_hashing],axis=1)
X_test.drop(['country'],axis=1,inplace=True)
Since our data is unbalanced, we can apply various over-sampling, undersampling methods to tackle this issue. But, since, we only have access to Sklearn, we would just stick with what we can achieve using sklearn.
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report,confusion_matrix,precision_recall_curve,average_precision_score, accuracy_score
import pickle
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train,y_train)
RandomForestClassifier(n_estimators=500)
y_hat = rf.predict(X_test)
y_hat
array([0, 0, 0, ..., 1, 0, 0], dtype=int64)
sum(y_test ==y_hat)/len(y_test)
0.9563888888888888
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 1.00 0.98 35883
Class 1 0.99 0.54 0.70 3717
accuracy 0.96 39600
macro avg 0.97 0.77 0.84 39600
weighted avg 0.96 0.96 0.95 39600
Here, the results are quite decent but the recall is too low for our usecase as it costs us $8 for false positive but false negative can cost us much more than that so we need higher recall.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train,y_train)
GradientBoostingClassifier()
y_hat = gb.predict(X_test)
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 1.00 0.98 35883
Class 1 0.99 0.54 0.70 3717
accuracy 0.96 39600
macro avg 0.97 0.77 0.84 39600
weighted avg 0.96 0.96 0.95 39600
precision,recall,threshold = precision_recall_curve(y_test,y_hat)
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Tradeoff')
plt.show()
Same thing can be said about above result. Recall we find is not good enough to use.
pipeline = Pipeline([
('scaler',StandardScaler()),
('rf',RandomForestClassifier())
])
params = {
'rf__n_estimators':[200,230],
'rf__max_depth':[30,50],
'rf__min_samples_split':[2,3],
'rf__min_samples_leaf':[3,5],
'rf__class_weight':[{0:1,1:1},{0:1,1:5},{0:1,1:3},'balanced']
}
# randomForest = GridSearchCV(pipeline,param_grid=params,scoring='roc_auc',cv=3)
# randomForest.fit(X_train,y_train)
filename = 'finalized_model.sav'
# pickle.dump(randomForest, open(filename, 'wb'))
# randomForest = pickle.load(open(filename, 'rb'))
y_hat = randomForest.predict(X_test)
c:\users\madha\appdata\local\programs\python\python38\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names warnings.warn(
y_prob = randomForest.predict_proba(X_test)
c:\users\madha\appdata\local\programs\python\python38\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names warnings.warn(
print(classification_report(y_test,y_hat))
precision recall f1-score support
0 0.95 1.00 0.98 35883
1 0.99 0.54 0.70 3717
accuracy 0.96 39600
macro avg 0.97 0.77 0.84 39600
weighted avg 0.96 0.96 0.95 39600
print(confusion_matrix(y_test,y_hat))
[[35862 21] [ 1706 2011]]
auprc = average_precision_score(y_test, y_hat)
auprc
0.5785171889145597
precision,recall,threshold = precision_recall_curve(y_test,y_hat)
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Tradeoff')
plt.show()
y_result = []
for i in range(len(y_prob)):
if int(X_test.loc[i,'purchase_value'])>8:#if the purchase value is more than 8, we lower the threshold
if y_prob[i][1] > 0.1:
y_result.append(1)
else:
y_result.append(0)
else:
if y_prob[i][1]>0.5:
y_result.append(1)
else:
y_result.append(0)
average_precision_score(y_test, y_result)
0.3906567264911223
print(classification_report(y_test,y_result))
precision recall f1-score support
0 0.97 0.94 0.95 35883
1 0.52 0.69 0.60 3717
accuracy 0.91 39600
macro avg 0.75 0.81 0.77 39600
weighted avg 0.93 0.91 0.92 39600
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X_train, y_train)
X_smote
| elapsed_time | purchase_value | age | signup_hour | purchase_hour | signup_dayoftheweek | purchase_dayoftheweek | signup_day_of_the_month | purchase_day_of_the_month | signup_month | purchase_month | elapsed_time_weeks | source_Ads | source_Direct | source_SEO | browser_Chrome | browser_FireFox | browser_IE | browser_Opera | browser_Safari | sex_F | sex_M | Number_of_times_IP_used_before | Number_of_times_device_ID_used_before | country_1 | country_2 | country_3 | country_4 | country_5 | country_6 | country_7 | country_8 | country_9 | country_10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 26 | 35 | 11 | 11 | 2 | 5 | 1 | 29 | 7 | 8 | 8 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.000000 | -1.000000 | 1.000000 | 0.0 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 |
| 1 | 2 | 30 | 35 | 16 | 23 | 2 | 5 | 8 | 19 | 7 | 9 | 10 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -1.000000 | -2.000000 | 0.000000 | 1.0 | -3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 2 | 2 | 21 | 23 | 5 | 0 | 6 | 5 | 1 | 9 | 3 | 5 | 9 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.000000 | -1.000000 | 1.000000 | 0.0 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 |
| 3 | 3 | 94 | 30 | 9 | 17 | 3 | 1 | 28 | 25 | 5 | 8 | 12 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0.000000 | -1.000000 | 0.000000 | 0.0 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 3.000000 |
| 4 | 0 | 59 | 33 | 21 | 21 | 5 | 5 | 10 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 9 | 0.000000 | -2.000000 | 4.000000 | 3.0 | 0.000000 | 1.000000 | -1.000000 | -1.000000 | 0.000000 | 3.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145699 | 3 | 10 | 21 | 22 | 2 | 5 | 6 | 21 | 19 | 5 | 8 | 12 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0.880625 | -1.761249 | -0.119375 | 0.0 | -0.761249 | -0.880625 | -0.119375 | 0.119375 | 0.000000 | -0.119375 |
| 145700 | 0 | 17 | 34 | 17 | 21 | 4 | 4 | 12 | 18 | 5 | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1.369177 | -2.369177 | 0.000000 | 0.0 | -1.000000 | -0.630823 | 0.000000 | 0.000000 | 0.000000 | -0.738353 |
| 145701 | 0 | 42 | 29 | 23 | 23 | 0 | 0 | 12 | 12 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 4 | 1.000000 | 0.000000 | 0.000000 | 0.0 | -1.000000 | 1.000000 | 0.000000 | -1.000000 | -1.000000 | 1.000000 |
| 145702 | 1 | 36 | 27 | 11 | 4 | 0 | 1 | 4 | 17 | 1 | 2 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0.356785 | -0.643215 | 0.643215 | 0.0 | -1.000000 | 0.356785 | 0.000000 | -0.356785 | -1.000000 | -0.286430 |
| 145703 | 0 | 51 | 40 | 2 | 2 | 3 | 3 | 8 | 8 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | 1.799013 | 0.598025 | 0.000000 | 0.0 | -1.000000 | 0.000000 | -1.000000 | 0.000000 | -0.200987 | 0.000000 |
145704 rows × 34 columns
Since, we are working with imbalanced dataset, let's try anomaly detection.
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_smote,y_smote)
RandomForestClassifier(n_estimators=200)
y_hat = rf.predict(X_test)
y_hat
array([0, 0, 0, ..., 1, 0, 0], dtype=int64)
sum(y_test ==y_hat)/len(y_test)
0.9562878787878788
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.96 0.99 0.97 35883
Class 1 0.88 0.55 0.68 3717
accuracy 0.95 39600
macro avg 0.92 0.77 0.83 39600
weighted avg 0.95 0.95 0.95 39600
from sklearn.ensemble import IsolationForest
f = h.transform(X['country'])
df_hashing = pd.DataFrame(f.toarray(),columns=['country_1','country_2','country_3','country_4','country_5','country_6','country_7',
'country_8','country_9','country_10'])
X_temp = pd.concat([X,df_hashing],axis=1)
X_temp.drop(['country'],axis=1,inplace=True)
X_temp.drop(['device_id','ip_address'],axis=1,inplace=True)
outlier_fraction = y.value_counts()[1]/float(y.value_counts()[0])
outlier_fraction
0.10360049662022348
IF = IsolationForest(n_estimators=200, max_samples=len(X_temp),
contamination=outlier_fraction,random_state=42, verbose=0)
IF.fit(X_temp)
c:\users\madha\appdata\local\programs\python\python38\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
IsolationForest(contamination=0.10360049662022348, max_samples=120000,
n_estimators=200, random_state=42)
scores_prediction = IF.decision_function(X_temp)
y_pred = IF.predict(X_temp)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
print(accuracy_score(y,y_pred))
0.819625
print(classification_report(y,y_pred))
precision recall f1-score support
0 0.90 0.90 0.90 108735
1 0.08 0.09 0.09 11265
accuracy 0.82 120000
macro avg 0.49 0.49 0.49 120000
weighted avg 0.83 0.82 0.82 120000
Here, the results are not impressive as compared to Random Forest
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tensorflow.keras.losses import MeanSquaredLogarithmicError
f = h.transform(X['country'])
df_hashing = pd.DataFrame(f.toarray(),columns=['country_1','country_2','country_3','country_4','country_5','country_6','country_7',
'country_8','country_9','country_10'])
X_temp = pd.concat([X,df_hashing],axis=1)
X_temp.drop(['country'],axis=1,inplace=True)
X_temp.drop(['device_id','ip_address'],axis=1,inplace=True)
x_train, x_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.33, stratify=y)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_train, y_train = sm.fit_resample(x_train, y_train)
train_index = y_train[y_train == 1].index
train_data = x_train.loc[train_index]
min_max_scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = min_max_scaler.fit_transform(train_data.copy())
x_test_scaled = min_max_scaler.transform(x_test.copy())
class AutoEncoder(Model):
"""
Parameters
----------
output_units: int
Number of output units
code_size: int
Number of units in bottle neck
"""
def __init__(self, output_units, code_size=8):
super().__init__()
self.encoder = Sequential([
Dense(64, activation='relu'),
Dropout(0.1),
Dense(32, activation='relu'),
Dropout(0.1),
Dense(16, activation='relu'),
Dropout(0.1),
Dense(code_size, activation='relu')
])
self.decoder = Sequential([
Dense(16, activation='relu'),
Dropout(0.1),
Dense(32, activation='relu'),
Dropout(0.1),
Dense(64, activation='relu'),
Dropout(0.1),
Dense(output_units, activation='sigmoid')
])
def call(self, inputs):
encoded = self.encoder(inputs)
decoded = self.decoder(encoded)
return decoded
model = AutoEncoder(output_units=x_train_scaled.shape[1])
# configurations of model
model.compile(loss='msle', metrics=['mse'], optimizer='adam')
history = model.fit(
x_train_scaled,
x_train_scaled,
epochs=100,
batch_size=512,
validation_data=(x_test_scaled, x_test_scaled)
)
Epoch 1/100 143/143 [==============================] - 3s 7ms/step - loss: 0.0476 - mse: 0.0928 - val_loss: 0.0471 - val_mse: 0.1016 Epoch 2/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0320 - mse: 0.0662 - val_loss: 0.0399 - val_mse: 0.0878 Epoch 3/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0275 - mse: 0.0572 - val_loss: 0.0365 - val_mse: 0.0802 Epoch 4/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0245 - mse: 0.0506 - val_loss: 0.0339 - val_mse: 0.0743 Epoch 5/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0224 - mse: 0.0464 - val_loss: 0.0323 - val_mse: 0.0703 Epoch 6/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0212 - mse: 0.0438 - val_loss: 0.0309 - val_mse: 0.0675 Epoch 7/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0204 - mse: 0.0420 - val_loss: 0.0296 - val_mse: 0.0647 Epoch 8/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0196 - mse: 0.0406 - val_loss: 0.0286 - val_mse: 0.0621 Epoch 9/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0190 - mse: 0.0393 - val_loss: 0.0281 - val_mse: 0.0612 Epoch 10/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0185 - mse: 0.0382 - val_loss: 0.0274 - val_mse: 0.0598 Epoch 11/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0181 - mse: 0.0374 - val_loss: 0.0267 - val_mse: 0.0582 Epoch 12/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0178 - mse: 0.0366 - val_loss: 0.0260 - val_mse: 0.0568 Epoch 13/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0174 - mse: 0.0359 - val_loss: 0.0254 - val_mse: 0.0554 Epoch 14/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0171 - mse: 0.0352 - val_loss: 0.0244 - val_mse: 0.0535 Epoch 15/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0167 - mse: 0.0344 - val_loss: 0.0237 - val_mse: 0.0521 Epoch 16/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0164 - mse: 0.0337 - val_loss: 0.0232 - val_mse: 0.0509 Epoch 17/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0162 - mse: 0.0332 - val_loss: 0.0226 - val_mse: 0.0497 Epoch 18/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0160 - mse: 0.0328 - val_loss: 0.0226 - val_mse: 0.0495 Epoch 19/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0157 - mse: 0.0323 - val_loss: 0.0223 - val_mse: 0.0488 Epoch 20/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0155 - mse: 0.0318 - val_loss: 0.0219 - val_mse: 0.0480 Epoch 21/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0153 - mse: 0.0315 - val_loss: 0.0218 - val_mse: 0.0475 Epoch 22/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0151 - mse: 0.0310 - val_loss: 0.0216 - val_mse: 0.0470 Epoch 23/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0150 - mse: 0.0307 - val_loss: 0.0215 - val_mse: 0.0467 Epoch 24/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0148 - mse: 0.0304 - val_loss: 0.0215 - val_mse: 0.0467 Epoch 25/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0147 - mse: 0.0301 - val_loss: 0.0214 - val_mse: 0.0463 Epoch 26/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0146 - mse: 0.0299 - val_loss: 0.0212 - val_mse: 0.0461 Epoch 27/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0145 - mse: 0.0297 - val_loss: 0.0210 - val_mse: 0.0455 Epoch 28/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0143 - mse: 0.0294 - val_loss: 0.0212 - val_mse: 0.0460 Epoch 29/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0142 - mse: 0.0292 - val_loss: 0.0210 - val_mse: 0.0456 Epoch 30/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0142 - mse: 0.0290 - val_loss: 0.0209 - val_mse: 0.0453 Epoch 31/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0141 - mse: 0.0289 - val_loss: 0.0210 - val_mse: 0.0454 Epoch 32/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0140 - mse: 0.0287 - val_loss: 0.0209 - val_mse: 0.0452 Epoch 33/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0139 - mse: 0.0286 - val_loss: 0.0209 - val_mse: 0.0453 Epoch 34/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0139 - mse: 0.0285 - val_loss: 0.0207 - val_mse: 0.0449 Epoch 35/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0138 - mse: 0.0283 - val_loss: 0.0207 - val_mse: 0.0449 Epoch 36/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0137 - mse: 0.0281 - val_loss: 0.0207 - val_mse: 0.0449 Epoch 37/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0136 - mse: 0.0280 - val_loss: 0.0208 - val_mse: 0.0450 Epoch 38/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0136 - mse: 0.0279 - val_loss: 0.0208 - val_mse: 0.0451 Epoch 39/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0135 - mse: 0.0278 - val_loss: 0.0206 - val_mse: 0.0446 Epoch 40/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0135 - mse: 0.0277 - val_loss: 0.0206 - val_mse: 0.0444 Epoch 41/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0135 - mse: 0.0277 - val_loss: 0.0206 - val_mse: 0.0445 Epoch 42/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0134 - mse: 0.0275 - val_loss: 0.0206 - val_mse: 0.0444 Epoch 43/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0134 - mse: 0.0274 - val_loss: 0.0205 - val_mse: 0.0443 Epoch 44/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0134 - mse: 0.0274 - val_loss: 0.0205 - val_mse: 0.0444 Epoch 45/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0133 - mse: 0.0272 - val_loss: 0.0205 - val_mse: 0.0444 Epoch 46/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0132 - mse: 0.0272 - val_loss: 0.0204 - val_mse: 0.0441 Epoch 47/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0133 - mse: 0.0272 - val_loss: 0.0202 - val_mse: 0.0438 Epoch 48/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0132 - mse: 0.0271 - val_loss: 0.0202 - val_mse: 0.0436 Epoch 49/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0131 - mse: 0.0269 - val_loss: 0.0203 - val_mse: 0.0439 Epoch 50/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0131 - mse: 0.0269 - val_loss: 0.0203 - val_mse: 0.0439 Epoch 51/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0131 - mse: 0.0268 - val_loss: 0.0199 - val_mse: 0.0431 Epoch 52/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0131 - mse: 0.0268 - val_loss: 0.0200 - val_mse: 0.0432 Epoch 53/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0130 - mse: 0.0267 - val_loss: 0.0201 - val_mse: 0.0434 Epoch 54/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0130 - mse: 0.0267 - val_loss: 0.0200 - val_mse: 0.0433 Epoch 55/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0130 - mse: 0.0266 - val_loss: 0.0198 - val_mse: 0.0429 Epoch 56/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0130 - mse: 0.0266 - val_loss: 0.0199 - val_mse: 0.0431 Epoch 57/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0129 - mse: 0.0265 - val_loss: 0.0198 - val_mse: 0.0428 Epoch 58/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0129 - mse: 0.0264 - val_loss: 0.0196 - val_mse: 0.0424 Epoch 59/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0129 - mse: 0.0264 - val_loss: 0.0197 - val_mse: 0.0427 Epoch 60/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0128 - mse: 0.0263 - val_loss: 0.0197 - val_mse: 0.0425 Epoch 61/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0128 - mse: 0.0262 - val_loss: 0.0197 - val_mse: 0.0425 Epoch 62/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0128 - mse: 0.0262 - val_loss: 0.0196 - val_mse: 0.0423 Epoch 63/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0128 - mse: 0.0261 - val_loss: 0.0197 - val_mse: 0.0425 Epoch 64/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0128 - mse: 0.0261 - val_loss: 0.0195 - val_mse: 0.0421 Epoch 65/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0127 - mse: 0.0261 - val_loss: 0.0194 - val_mse: 0.0419 Epoch 66/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0127 - mse: 0.0260 - val_loss: 0.0194 - val_mse: 0.0419 Epoch 67/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0127 - mse: 0.0260 - val_loss: 0.0194 - val_mse: 0.0417 Epoch 68/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0127 - mse: 0.0260 - val_loss: 0.0194 - val_mse: 0.0418 Epoch 69/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0259 - val_loss: 0.0192 - val_mse: 0.0413 Epoch 70/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0258 - val_loss: 0.0192 - val_mse: 0.0414 Epoch 71/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0258 - val_loss: 0.0193 - val_mse: 0.0415 Epoch 72/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0258 - val_loss: 0.0191 - val_mse: 0.0412 Epoch 73/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0257 - val_loss: 0.0191 - val_mse: 0.0412 Epoch 74/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0257 - val_loss: 0.0192 - val_mse: 0.0414 Epoch 75/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0258 - val_loss: 0.0191 - val_mse: 0.0410 Epoch 76/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0126 - mse: 0.0258 - val_loss: 0.0190 - val_mse: 0.0409 Epoch 77/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0256 - val_loss: 0.0191 - val_mse: 0.0411 Epoch 78/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0256 - val_loss: 0.0189 - val_mse: 0.0408 Epoch 79/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0256 - val_loss: 0.0189 - val_mse: 0.0407 Epoch 80/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0125 - mse: 0.0256 - val_loss: 0.0189 - val_mse: 0.0408 Epoch 81/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0124 - mse: 0.0255 - val_loss: 0.0186 - val_mse: 0.0401 Epoch 82/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0255 - val_loss: 0.0188 - val_mse: 0.0405 Epoch 83/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0254 - val_loss: 0.0185 - val_mse: 0.0399 Epoch 84/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0254 - val_loss: 0.0186 - val_mse: 0.0401 Epoch 85/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0255 - val_loss: 0.0185 - val_mse: 0.0398 Epoch 86/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0254 - val_loss: 0.0185 - val_mse: 0.0398 Epoch 87/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0124 - mse: 0.0254 - val_loss: 0.0185 - val_mse: 0.0398 Epoch 88/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0123 - mse: 0.0253 - val_loss: 0.0183 - val_mse: 0.0394 Epoch 89/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0123 - mse: 0.0252 - val_loss: 0.0184 - val_mse: 0.0395 Epoch 90/100 143/143 [==============================] - 1s 7ms/step - loss: 0.0123 - mse: 0.0252 - val_loss: 0.0184 - val_mse: 0.0397 Epoch 91/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0123 - mse: 0.0253 - val_loss: 0.0183 - val_mse: 0.0395 Epoch 92/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0123 - mse: 0.0252 - val_loss: 0.0181 - val_mse: 0.0390 Epoch 93/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0123 - mse: 0.0251 - val_loss: 0.0180 - val_mse: 0.0388 Epoch 94/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0123 - mse: 0.0251 - val_loss: 0.0181 - val_mse: 0.0391 Epoch 95/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0251 - val_loss: 0.0181 - val_mse: 0.0389 Epoch 96/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0251 - val_loss: 0.0180 - val_mse: 0.0388 Epoch 97/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0250 - val_loss: 0.0178 - val_mse: 0.0383 Epoch 98/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0250 - val_loss: 0.0178 - val_mse: 0.0384 Epoch 99/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0250 - val_loss: 0.0180 - val_mse: 0.0387 Epoch 100/100 143/143 [==============================] - 1s 6ms/step - loss: 0.0122 - mse: 0.0250 - val_loss: 0.0177 - val_mse: 0.0382
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('Epochs')
plt.ylabel('MSLE Loss')
plt.legend(['loss', 'val_loss'])
plt.show()
def find_threshold(model, x_train_scaled):
reconstructions = model.predict(x_train_scaled)
# provides losses of individual instances
reconstruction_errors = tf.keras.losses.msle(reconstructions, x_train_scaled)
# threshold for anomaly scores
threshold = np.mean(reconstruction_errors.numpy()) \
+ np.std(reconstruction_errors.numpy())
return threshold
def get_predictions(model, x_test_scaled, threshold):
predictions = model.predict(x_test_scaled)
# provides losses of individual instances
errors = tf.keras.losses.msle(predictions, x_test_scaled)
# 0 = anomaly, 1 = normal
anomaly_mask = pd.Series(errors) > threshold
preds = anomaly_mask.map(lambda x: 0.0 if x == True else 1.0)
return preds
threshold = find_threshold(model, x_train_scaled)
print(f"Threshold: {threshold}")
# Threshold: 0.01001314025746261
predictions = get_predictions(model, x_test_scaled, threshold)
accuracy_score(predictions, y_test)
Threshold: 0.01700200599599767
0.5177272727272727
print(confusion_matrix(y_test,predictions))
[[17078 18805] [ 1761 1956]]
print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.91 0.48 0.62 35883
1 0.09 0.53 0.16 3717
accuracy 0.48 39600
macro avg 0.50 0.50 0.39 39600
weighted avg 0.83 0.48 0.58 39600
Here, again this result we get is also not that great. So, we can conclude that Random Forest is our best option for this dataset.
import plotly.offline as pyo
import plotly.graph_objs as go
# Set notebook mode to work in offline
pyo.init_notebook_mode()
X_train, x_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.33, stratify=y)
sel = SelectFromModel(RandomForestClassifier(n_estimators = 500))
sel.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier(n_estimators=500))
X_train_temp = X_train.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
x_test_temp = x_test.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train_temp,y_train)
RandomForestClassifier(n_estimators=500)
y_hat = rf.predict(x_test_temp)
y_hat
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 1.00 0.98 35883
Class 1 0.99 0.53 0.69 3717
accuracy 0.96 39600
macro avg 0.97 0.76 0.83 39600
weighted avg 0.96 0.96 0.95 39600
Here, after dropping some features, we get slight drop in recall
auprc = average_precision_score(y_test,y_hat)
auprc
0.5686039725726
y_prob = rf.predict_proba(x_test_temp)
precision,recall,threshold = precision_recall_curve(y_test,y_hat)
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Tradeoff')
plt.show()
In terms of precision-recall trade-off, there is not many things that we can do to improve the precision without it affecting the Recall
X_train, x_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.33, stratify=y)
sel = SelectFromModel(RandomForestClassifier(n_estimators = 500))
sel.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier(n_estimators=500))
X_train_temp = X_train.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
x_test_temp = x_test.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
sc = StandardScaler()
X_train_temp = sc.fit_transform(X_train_temp)
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train_temp,y_train)
RandomForestClassifier(n_estimators=500)
x_test_temp = sc.fit_transform(x_test_temp)
y_hat = rf.predict(x_test_temp)
y_hat
array([0, 0, 0, ..., 0, 0, 1], dtype=int64)
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 1.00 0.98 35883
Class 1 0.99 0.53 0.69 3717
accuracy 0.95 39600
macro avg 0.97 0.76 0.83 39600
weighted avg 0.96 0.95 0.95 39600
X_train, x_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.33, stratify=y)
X_train_temp = X_train.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
x_test_temp = x_test.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
import optuna
import sklearn
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 10,1000)
max_depth = int(trial.suggest_int('max_depth', 1,1000))
max_features = trial.suggest_categorical('max_features', ['auto', 'sqrt','log2'])
max_leaf_nodes = trial.suggest_int('max_leaf_nodes', 1, 100)
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,max_features=max_features,
max_leaf_nodes=max_leaf_nodes,class_weight='balanced')
return sklearn.model_selection.cross_val_score(clf, X_train_temp, y_train, n_jobs=-1).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
[I 2022-03-25 11:18:08,696] A new study created in memory with name: no-name-bf3de04c-d5c7-481d-ad3b-ab0c42862e8e [I 2022-03-25 11:18:17,643] Trial 0 finished with value: 0.9477611940298507 and parameters: {'n_estimators': 251, 'max_depth': 203, 'max_features': 'sqrt', 'max_leaf_nodes': 57}. Best is trial 0 with value: 0.9477611940298507. [I 2022-03-25 11:18:38,549] Trial 1 finished with value: 0.9491542288557213 and parameters: {'n_estimators': 577, 'max_depth': 152, 'max_features': 'log2', 'max_leaf_nodes': 88}. Best is trial 1 with value: 0.9491542288557213. [I 2022-03-25 11:18:49,605] Trial 2 finished with value: 0.9472885572139305 and parameters: {'n_estimators': 408, 'max_depth': 173, 'max_features': 'log2', 'max_leaf_nodes': 19}. Best is trial 1 with value: 0.9491542288557213. [I 2022-03-25 11:18:52,779] Trial 3 finished with value: 0.9433333333333334 and parameters: {'n_estimators': 113, 'max_depth': 336, 'max_features': 'auto', 'max_leaf_nodes': 10}. Best is trial 1 with value: 0.9491542288557213. [I 2022-03-25 11:19:11,547] Trial 4 finished with value: 0.9494402985074627 and parameters: {'n_estimators': 553, 'max_depth': 15, 'max_features': 'auto', 'max_leaf_nodes': 63}. Best is trial 4 with value: 0.9494402985074627. [I 2022-03-25 11:19:34,437] Trial 5 finished with value: 0.9490671641791044 and parameters: {'n_estimators': 627, 'max_depth': 427, 'max_features': 'auto', 'max_leaf_nodes': 96}. Best is trial 4 with value: 0.9494402985074627. [I 2022-03-25 11:19:48,436] Trial 6 finished with value: 0.9495398009950249 and parameters: {'n_estimators': 407, 'max_depth': 615, 'max_features': 'sqrt', 'max_leaf_nodes': 75}. Best is trial 6 with value: 0.9495398009950249. [I 2022-03-25 11:20:03,438] Trial 7 finished with value: 0.9496268656716417 and parameters: {'n_estimators': 424, 'max_depth': 244, 'max_features': 'auto', 'max_leaf_nodes': 86}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:20:17,226] Trial 8 finished with value: 0.9487810945273631 and parameters: {'n_estimators': 422, 'max_depth': 478, 'max_features': 'log2', 'max_leaf_nodes': 60}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:20:30,860] Trial 9 finished with value: 0.928271144278607 and parameters: {'n_estimators': 862, 'max_depth': 520, 'max_features': 'sqrt', 'max_leaf_nodes': 3}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:20:55,781] Trial 10 finished with value: 0.9472388059701492 and parameters: {'n_estimators': 860, 'max_depth': 881, 'max_features': 'auto', 'max_leaf_nodes': 34}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:21:05,087] Trial 11 finished with value: 0.9490671641791046 and parameters: {'n_estimators': 262, 'max_depth': 691, 'max_features': 'sqrt', 'max_leaf_nodes': 79}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:21:05,655] Trial 12 finished with value: 0.944539800995025 and parameters: {'n_estimators': 11, 'max_depth': 628, 'max_features': 'sqrt', 'max_leaf_nodes': 77}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:21:30,513] Trial 13 finished with value: 0.9486442786069652 and parameters: {'n_estimators': 719, 'max_depth': 842, 'max_features': 'sqrt', 'max_leaf_nodes': 75}. Best is trial 7 with value: 0.9496268656716417. [I 2022-03-25 11:21:42,402] Trial 14 finished with value: 0.9496393034825872 and parameters: {'n_estimators': 313, 'max_depth': 714, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:21:54,466] Trial 15 finished with value: 0.9488557213930349 and parameters: {'n_estimators': 298, 'max_depth': 745, 'max_features': 'auto', 'max_leaf_nodes': 97}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:21:59,396] Trial 16 finished with value: 0.9477860696517413 and parameters: {'n_estimators': 138, 'max_depth': 990, 'max_features': 'auto', 'max_leaf_nodes': 43}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:22:37,662] Trial 17 finished with value: 0.9492661691542288 and parameters: {'n_estimators': 988, 'max_depth': 314, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:22:55,363] Trial 18 finished with value: 0.9490049751243781 and parameters: {'n_estimators': 483, 'max_depth': 46, 'max_features': 'auto', 'max_leaf_nodes': 86}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:23:07,127] Trial 19 finished with value: 0.9487686567164179 and parameters: {'n_estimators': 331, 'max_depth': 335, 'max_features': 'auto', 'max_leaf_nodes': 66}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:23:29,162] Trial 20 finished with value: 0.9486442786069652 and parameters: {'n_estimators': 686, 'max_depth': 556, 'max_features': 'auto', 'max_leaf_nodes': 46}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:23:44,494] Trial 21 finished with value: 0.9491293532338307 and parameters: {'n_estimators': 399, 'max_depth': 758, 'max_features': 'sqrt', 'max_leaf_nodes': 87}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:24:02,330] Trial 22 finished with value: 0.9487064676616915 and parameters: {'n_estimators': 501, 'max_depth': 619, 'max_features': 'sqrt', 'max_leaf_nodes': 70}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:24:10,824] Trial 23 finished with value: 0.9486442786069652 and parameters: {'n_estimators': 227, 'max_depth': 430, 'max_features': 'log2', 'max_leaf_nodes': 88}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:24:23,808] Trial 24 finished with value: 0.9492910447761194 and parameters: {'n_estimators': 358, 'max_depth': 583, 'max_features': 'auto', 'max_leaf_nodes': 81}. Best is trial 14 with value: 0.9496393034825872. [I 2022-03-25 11:24:30,406] Trial 25 finished with value: 0.9497512437810945 and parameters: {'n_estimators': 171, 'max_depth': 689, 'max_features': 'sqrt', 'max_leaf_nodes': 95}. Best is trial 25 with value: 0.9497512437810945. [I 2022-03-25 11:24:37,030] Trial 26 finished with value: 0.9492412935323383 and parameters: {'n_estimators': 166, 'max_depth': 815, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 25 with value: 0.9497512437810945. [I 2022-03-25 11:24:40,210] Trial 27 finished with value: 0.9450373134328359 and parameters: {'n_estimators': 79, 'max_depth': 952, 'max_features': 'sqrt', 'max_leaf_nodes': 92}. Best is trial 25 with value: 0.9497512437810945. [I 2022-03-25 11:24:48,948] Trial 28 finished with value: 0.9499502487562189 and parameters: {'n_estimators': 219, 'max_depth': 690, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:24:54,930] Trial 29 finished with value: 0.9471393034825871 and parameters: {'n_estimators': 202, 'max_depth': 704, 'max_features': 'log2', 'max_leaf_nodes': 29}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:24:56,167] Trial 30 finished with value: 0.9441293532338308 and parameters: {'n_estimators': 32, 'max_depth': 880, 'max_features': 'sqrt', 'max_leaf_nodes': 53}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:25:07,067] Trial 31 finished with value: 0.9483208955223882 and parameters: {'n_estimators': 272, 'max_depth': 675, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:25:13,888] Trial 32 finished with value: 0.9482587064676616 and parameters: {'n_estimators': 176, 'max_depth': 267, 'max_features': 'auto', 'max_leaf_nodes': 90}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:25:22,838] Trial 33 finished with value: 0.949092039800995 and parameters: {'n_estimators': 230, 'max_depth': 759, 'max_features': 'auto', 'max_leaf_nodes': 82}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:25:35,642] Trial 34 finished with value: 0.9478109452736317 and parameters: {'n_estimators': 330, 'max_depth': 77, 'max_features': 'auto', 'max_leaf_nodes': 93}. Best is trial 28 with value: 0.9499502487562189. [I 2022-03-25 11:25:39,658] Trial 35 finished with value: 0.9499626865671642 and parameters: {'n_estimators': 104, 'max_depth': 798, 'max_features': 'log2', 'max_leaf_nodes': 84}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:25:42,297] Trial 36 finished with value: 0.9488184079601991 and parameters: {'n_estimators': 68, 'max_depth': 814, 'max_features': 'log2', 'max_leaf_nodes': 70}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:25:46,379] Trial 37 finished with value: 0.9486567164179105 and parameters: {'n_estimators': 102, 'max_depth': 671, 'max_features': 'log2', 'max_leaf_nodes': 99}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:25:51,140] Trial 38 finished with value: 0.9485323383084576 and parameters: {'n_estimators': 124, 'max_depth': 903, 'max_features': 'log2', 'max_leaf_nodes': 93}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:25:58,087] Trial 39 finished with value: 0.9498009950248756 and parameters: {'n_estimators': 181, 'max_depth': 763, 'max_features': 'log2', 'max_leaf_nodes': 82}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:04,338] Trial 40 finished with value: 0.9485820895522388 and parameters: {'n_estimators': 170, 'max_depth': 780, 'max_features': 'log2', 'max_leaf_nodes': 71}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:12,865] Trial 41 finished with value: 0.9493407960199006 and parameters: {'n_estimators': 222, 'max_depth': 715, 'max_features': 'log2', 'max_leaf_nodes': 83}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:15,334] Trial 42 finished with value: 0.9470646766169153 and parameters: {'n_estimators': 60, 'max_depth': 811, 'max_features': 'log2', 'max_leaf_nodes': 92}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:20,424] Trial 43 finished with value: 0.947587064676617 and parameters: {'n_estimators': 134, 'max_depth': 511, 'max_features': 'log2', 'max_leaf_nodes': 85}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:32,036] Trial 44 finished with value: 0.9495771144278609 and parameters: {'n_estimators': 303, 'max_depth': 637, 'max_features': 'log2', 'max_leaf_nodes': 97}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:45,798] Trial 45 finished with value: 0.9481094527363185 and parameters: {'n_estimators': 367, 'max_depth': 567, 'max_features': 'log2', 'max_leaf_nodes': 90}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:26:54,982] Trial 46 finished with value: 0.9483333333333335 and parameters: {'n_estimators': 252, 'max_depth': 859, 'max_features': 'sqrt', 'max_leaf_nodes': 76}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:27:10,958] Trial 47 finished with value: 0.9483084577114427 and parameters: {'n_estimators': 451, 'max_depth': 932, 'max_features': 'log2', 'max_leaf_nodes': 64}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:27:16,814] Trial 48 finished with value: 0.9481592039800996 and parameters: {'n_estimators': 170, 'max_depth': 727, 'max_features': 'sqrt', 'max_leaf_nodes': 57}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:27:28,249] Trial 49 finished with value: 0.9486691542288558 and parameters: {'n_estimators': 281, 'max_depth': 658, 'max_features': 'log2', 'max_leaf_nodes': 96}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:27:28,747] Trial 50 finished with value: 0.9429726368159204 and parameters: {'n_estimators': 11, 'max_depth': 777, 'max_features': 'auto', 'max_leaf_nodes': 21}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:27:49,284] Trial 51 finished with value: 0.9483955223880596 and parameters: {'n_estimators': 561, 'max_depth': 437, 'max_features': 'auto', 'max_leaf_nodes': 80}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:28:05,548] Trial 52 finished with value: 0.9490422885572141 and parameters: {'n_estimators': 439, 'max_depth': 135, 'max_features': 'auto', 'max_leaf_nodes': 87}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:28:12,629] Trial 53 finished with value: 0.9479726368159203 and parameters: {'n_estimators': 196, 'max_depth': 250, 'max_features': 'auto', 'max_leaf_nodes': 74}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:28:24,915] Trial 54 finished with value: 0.9488930348258707 and parameters: {'n_estimators': 318, 'max_depth': 719, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:28:39,476] Trial 55 finished with value: 0.9492412935323383 and parameters: {'n_estimators': 394, 'max_depth': 378, 'max_features': 'sqrt', 'max_leaf_nodes': 84}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:03,050] Trial 56 finished with value: 0.9495024875621892 and parameters: {'n_estimators': 608, 'max_depth': 589, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:16,730] Trial 57 finished with value: 0.9476616915422886 and parameters: {'n_estimators': 358, 'max_depth': 483, 'max_features': 'auto', 'max_leaf_nodes': 89}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:20,887] Trial 58 finished with value: 0.9475870646766168 and parameters: {'n_estimators': 100, 'max_depth': 836, 'max_features': 'sqrt', 'max_leaf_nodes': 80}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:30,839] Trial 59 finished with value: 0.9499626865671642 and parameters: {'n_estimators': 249, 'max_depth': 792, 'max_features': 'auto', 'max_leaf_nodes': 97}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:37,257] Trial 60 finished with value: 0.9490049751243781 and parameters: {'n_estimators': 153, 'max_depth': 783, 'max_features': 'auto', 'max_leaf_nodes': 96}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:47,506] Trial 61 finished with value: 0.9491044776119404 and parameters: {'n_estimators': 246, 'max_depth': 540, 'max_features': 'auto', 'max_leaf_nodes': 92}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:29:55,904] Trial 62 finished with value: 0.9482835820895522 and parameters: {'n_estimators': 205, 'max_depth': 690, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:30:06,771] Trial 63 finished with value: 0.948146766169154 and parameters: {'n_estimators': 287, 'max_depth': 755, 'max_features': 'auto', 'max_leaf_nodes': 86}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:30:14,714] Trial 64 finished with value: 0.9490547263681591 and parameters: {'n_estimators': 247, 'max_depth': 641, 'max_features': 'auto', 'max_leaf_nodes': 39}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:30:28,756] Trial 65 finished with value: 0.9491915422885573 and parameters: {'n_estimators': 370, 'max_depth': 858, 'max_features': 'sqrt', 'max_leaf_nodes': 90}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:30:48,686] Trial 66 finished with value: 0.9493407960199004 and parameters: {'n_estimators': 537, 'max_depth': 909, 'max_features': 'auto', 'max_leaf_nodes': 78}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:07,952] Trial 67 finished with value: 0.9490422885572138 and parameters: {'n_estimators': 476, 'max_depth': 603, 'max_features': 'log2', 'max_leaf_nodes': 94}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:12,479] Trial 68 finished with value: 0.9480845771144277 and parameters: {'n_estimators': 108, 'max_depth': 805, 'max_features': 'auto', 'max_leaf_nodes': 97}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:19,985] Trial 69 finished with value: 0.9490796019900497 and parameters: {'n_estimators': 196, 'max_depth': 968, 'max_features': 'sqrt', 'max_leaf_nodes': 83}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:33,056] Trial 70 finished with value: 0.9496393034825872 and parameters: {'n_estimators': 337, 'max_depth': 736, 'max_features': 'log2', 'max_leaf_nodes': 89}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:46,686] Trial 71 finished with value: 0.9487810945273631 and parameters: {'n_estimators': 338, 'max_depth': 738, 'max_features': 'log2', 'max_leaf_nodes': 88}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:31:52,251] Trial 72 finished with value: 0.9431467661691542 and parameters: {'n_estimators': 266, 'max_depth': 695, 'max_features': 'log2', 'max_leaf_nodes': 7}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:04,740] Trial 73 finished with value: 0.948320895522388 and parameters: {'n_estimators': 309, 'max_depth': 784, 'max_features': 'log2', 'max_leaf_nodes': 98}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:10,306] Trial 74 finished with value: 0.9491666666666667 and parameters: {'n_estimators': 141, 'max_depth': 827, 'max_features': 'log2', 'max_leaf_nodes': 91}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:26,005] Trial 75 finished with value: 0.9492537313432836 and parameters: {'n_estimators': 402, 'max_depth': 663, 'max_features': 'log2', 'max_leaf_nodes': 95}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:35,194] Trial 76 finished with value: 0.948134328358209 and parameters: {'n_estimators': 228, 'max_depth': 750, 'max_features': 'auto', 'max_leaf_nodes': 85}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:37,418] Trial 77 finished with value: 0.949589552238806 and parameters: {'n_estimators': 53, 'max_depth': 884, 'max_features': 'log2', 'max_leaf_nodes': 72}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:41,122] Trial 78 finished with value: 0.9489427860696518 and parameters: {'n_estimators': 86, 'max_depth': 707, 'max_features': 'auto', 'max_leaf_nodes': 88}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:32:56,895] Trial 79 finished with value: 0.9484203980099503 and parameters: {'n_estimators': 425, 'max_depth': 617, 'max_features': 'log2', 'max_leaf_nodes': 67}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:04,689] Trial 80 finished with value: 0.9485945273631842 and parameters: {'n_estimators': 188, 'max_depth': 797, 'max_features': 'sqrt', 'max_leaf_nodes': 93}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:06,582] Trial 81 finished with value: 0.9476990049751244 and parameters: {'n_estimators': 46, 'max_depth': 888, 'max_features': 'log2', 'max_leaf_nodes': 73}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:11,370] Trial 82 finished with value: 0.9474378109452737 and parameters: {'n_estimators': 124, 'max_depth': 854, 'max_features': 'log2', 'max_leaf_nodes': 77}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:12,809] Trial 83 finished with value: 0.9490671641791044 and parameters: {'n_estimators': 32, 'max_depth': 929, 'max_features': 'log2', 'max_leaf_nodes': 98}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:15,513] Trial 84 finished with value: 0.9483830845771143 and parameters: {'n_estimators': 70, 'max_depth': 729, 'max_features': 'log2', 'max_leaf_nodes': 83}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:21,354] Trial 85 finished with value: 0.9485696517412936 and parameters: {'n_estimators': 156, 'max_depth': 172, 'max_features': 'log2', 'max_leaf_nodes': 81}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:33:28,683] Trial 86 finished with value: 0.9469402985074626 and parameters: {'n_estimators': 220, 'max_depth': 880, 'max_features': 'auto', 'max_leaf_nodes': 50}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:34:04,456] Trial 87 finished with value: 0.9493781094527364 and parameters: {'n_estimators': 955, 'max_depth': 760, 'max_features': 'log2', 'max_leaf_nodes': 91}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:34:17,492] Trial 88 finished with value: 0.949402985074627 and parameters: {'n_estimators': 338, 'max_depth': 684, 'max_features': 'auto', 'max_leaf_nodes': 86}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:34:29,480] Trial 89 finished with value: 0.9484203980099503 and parameters: {'n_estimators': 289, 'max_depth': 996, 'max_features': 'log2', 'max_leaf_nodes': 94}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:34:33,392] Trial 90 finished with value: 0.9477487562189056 and parameters: {'n_estimators': 92, 'max_depth': 647, 'max_features': 'sqrt', 'max_leaf_nodes': 100}. Best is trial 35 with value: 0.9499626865671642. [I 2022-03-25 11:34:48,561] Trial 91 finished with value: 0.952549751243781 and parameters: {'n_estimators': 385, 'max_depth': 8, 'max_features': 'log2', 'max_leaf_nodes': 97}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:35:06,496] Trial 92 finished with value: 0.9495646766169156 and parameters: {'n_estimators': 461, 'max_depth': 91, 'max_features': 'log2', 'max_leaf_nodes': 97}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:35:21,590] Trial 93 finished with value: 0.9491791044776118 and parameters: {'n_estimators': 394, 'max_depth': 11, 'max_features': 'log2', 'max_leaf_nodes': 89}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:35:35,141] Trial 94 finished with value: 0.948407960199005 and parameters: {'n_estimators': 356, 'max_depth': 230, 'max_features': 'log2', 'max_leaf_nodes': 94}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:36:00,321] Trial 95 finished with value: 0.9486691542288558 and parameters: {'n_estimators': 742, 'max_depth': 107, 'max_features': 'log2', 'max_leaf_nodes': 60}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:36:20,559] Trial 96 finished with value: 0.9497636815920398 and parameters: {'n_estimators': 511, 'max_depth': 378, 'max_features': 'auto', 'max_leaf_nodes': 98}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:36:40,795] Trial 97 finished with value: 0.94931592039801 and parameters: {'n_estimators': 525, 'max_depth': 315, 'max_features': 'auto', 'max_leaf_nodes': 97}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:37:00,296] Trial 98 finished with value: 0.9489925373134328 and parameters: {'n_estimators': 492, 'max_depth': 381, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 91 with value: 0.952549751243781. [I 2022-03-25 11:37:18,487] Trial 99 finished with value: 0.9488930348258707 and parameters: {'n_estimators': 427, 'max_depth': 277, 'max_features': 'auto', 'max_leaf_nodes': 92}. Best is trial 91 with value: 0.952549751243781.
trial = study.best_trial
print('Accuracy: {}'.format(trial.value))
Accuracy: 0.952549751243781
print("Best hyperparameters: {}".format(trial.params))
Best hyperparameters: {'n_estimators': 385, 'max_depth': 8, 'max_features': 'log2', 'max_leaf_nodes': 97}
optuna.visualization.plot_slice(study)
optuna.visualization.plot_optimization_history(study)
optimised_rf = RandomForestClassifier(max_depth = study.best_params['max_depth'],
max_features = study.best_params['max_features'],
max_leaf_nodes = study.best_params['max_leaf_nodes'],
n_estimators = study.best_params['n_estimators'],
class_weight='balanced',
n_jobs=-1)
optimised_rf.fit(X_train_temp,y_train)
RandomForestClassifier(class_weight='balanced', max_depth=384,
max_features='log2', max_leaf_nodes=78, n_estimators=678,
n_jobs=-1)
y_hat = optimised_rf.predict(x_test_temp)
y_hat
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
auprc = average_precision_score(y_test,y_hat)
print("Area Under Precision Recall Curve:",auprc)
Area Under Precision Recall Curve: 0.5228594900166278
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 0.99 0.97 35883
Class 1 0.89 0.54 0.67 3717
accuracy 0.95 39600
macro avg 0.92 0.77 0.82 39600
weighted avg 0.95 0.95 0.94 39600
from sklearn.metrics import recall_score, precision_score
recall_score(y_test,y_hat)
0.5415657788539144
kf = sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
score = sklearn.model_selection.cross_val_score(RandomForestClassifier(max_depth = study.best_params['max_depth'],
max_features = study.best_params['max_features'],
max_leaf_nodes = study.best_params['max_leaf_nodes'],
n_estimators = study.best_params['n_estimators'],
class_weight='balanced',
n_jobs=1),
X_train_temp, y_train, cv= kf, scoring="recall")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.52617628 0.53015242 0.54768212 0.53377483 0.54701987] Average score: 0.54
X_train, x_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.33, stratify=y)
X_train_temp = X_train.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
x_test_temp = x_test.loc[:,X_train.columns[np.array(sel.estimator_.feature_importances_).argsort()[::-1][:15]]]
import optuna
import sklearn
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 10,1000)
max_depth = int(trial.suggest_int('max_depth', 1,1000))
max_features = trial.suggest_categorical('max_features', ['auto', 'sqrt','log2'])
max_leaf_nodes = trial.suggest_int('max_leaf_nodes', 2, 100)
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,max_features=max_features,
max_leaf_nodes=max_leaf_nodes,class_weight='balanced')
clf.fit(X_train_temp, y_train)
pred = clf.predict(x_test_temp)
prauc = average_precision_score(y_test, pred)
return prauc
study = optuna.create_study(directions=['maximize'])
study.optimize(objective, n_trials=100)
[I 2022-03-25 18:44:15,681] A new study created in memory with name: no-name-5df87774-aa74-4e44-bf1d-dd9b459e5544 [I 2022-03-25 18:44:48,766] Trial 0 finished with value: 0.5061714258692438 and parameters: {'n_estimators': 874, 'max_depth': 151, 'max_features': 'sqrt', 'max_leaf_nodes': 68}. Best is trial 0 with value: 0.5061714258692438. [I 2022-03-25 18:45:04,629] Trial 1 finished with value: 0.5075793790972429 and parameters: {'n_estimators': 364, 'max_depth': 468, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 1 with value: 0.5075793790972429. [I 2022-03-25 18:45:11,727] Trial 2 finished with value: 0.48084601301083696 and parameters: {'n_estimators': 167, 'max_depth': 551, 'max_features': 'log2', 'max_leaf_nodes': 81}. Best is trial 1 with value: 0.5075793790972429. [I 2022-03-25 18:45:22,597] Trial 3 finished with value: 0.4832262762923327 and parameters: {'n_estimators': 267, 'max_depth': 414, 'max_features': 'log2', 'max_leaf_nodes': 69}. Best is trial 1 with value: 0.5075793790972429. [I 2022-03-25 18:45:45,048] Trial 4 finished with value: 0.4997747938747731 and parameters: {'n_estimators': 542, 'max_depth': 929, 'max_features': 'auto', 'max_leaf_nodes': 83}. Best is trial 1 with value: 0.5075793790972429. [I 2022-03-25 18:45:56,132] Trial 5 finished with value: 0.46367233246191986 and parameters: {'n_estimators': 495, 'max_depth': 590, 'max_features': 'auto', 'max_leaf_nodes': 4}. Best is trial 1 with value: 0.5075793790972429. [I 2022-03-25 18:46:22,048] Trial 6 finished with value: 0.5224287117097042 and parameters: {'n_estimators': 683, 'max_depth': 887, 'max_features': 'log2', 'max_leaf_nodes': 56}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:46:39,744] Trial 7 finished with value: 0.5059319588133147 and parameters: {'n_estimators': 453, 'max_depth': 827, 'max_features': 'auto', 'max_leaf_nodes': 70}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:46:53,363] Trial 8 finished with value: 0.5105789188953082 and parameters: {'n_estimators': 335, 'max_depth': 691, 'max_features': 'log2', 'max_leaf_nodes': 86}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:47:27,407] Trial 9 finished with value: 0.5054187878996955 and parameters: {'n_estimators': 896, 'max_depth': 372, 'max_features': 'log2', 'max_leaf_nodes': 57}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:47:51,482] Trial 10 finished with value: 0.4869622075494896 and parameters: {'n_estimators': 718, 'max_depth': 973, 'max_features': 'sqrt', 'max_leaf_nodes': 31}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:48:14,266] Trial 11 finished with value: 0.49471978739255795 and parameters: {'n_estimators': 687, 'max_depth': 749, 'max_features': 'log2', 'max_leaf_nodes': 32}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:48:18,607] Trial 12 finished with value: 0.4896902360371527 and parameters: {'n_estimators': 121, 'max_depth': 708, 'max_features': 'log2', 'max_leaf_nodes': 40}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:48:19,090] Trial 13 finished with value: 0.46396858438499533 and parameters: {'n_estimators': 12, 'max_depth': 669, 'max_features': 'log2', 'max_leaf_nodes': 98}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:48:44,248] Trial 14 finished with value: 0.5015748259840237 and parameters: {'n_estimators': 667, 'max_depth': 849, 'max_features': 'log2', 'max_leaf_nodes': 50}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:49:01,031] Trial 15 finished with value: 0.46800434091782644 and parameters: {'n_estimators': 598, 'max_depth': 260, 'max_features': 'log2', 'max_leaf_nodes': 12}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:49:15,833] Trial 16 finished with value: 0.5134107226723005 and parameters: {'n_estimators': 341, 'max_depth': 845, 'max_features': 'sqrt', 'max_leaf_nodes': 84}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:49:52,402] Trial 17 finished with value: 0.50168940457269 and parameters: {'n_estimators': 981, 'max_depth': 993, 'max_features': 'sqrt', 'max_leaf_nodes': 51}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:50:16,207] Trial 18 finished with value: 0.471389737450093 and parameters: {'n_estimators': 762, 'max_depth': 837, 'max_features': 'sqrt', 'max_leaf_nodes': 21}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:50:32,803] Trial 19 finished with value: 0.5026933503723428 and parameters: {'n_estimators': 430, 'max_depth': 70, 'max_features': 'sqrt', 'max_leaf_nodes': 61}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:51:02,072] Trial 20 finished with value: 0.5042001266776774 and parameters: {'n_estimators': 808, 'max_depth': 605, 'max_features': 'sqrt', 'max_leaf_nodes': 43}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:51:16,601] Trial 21 finished with value: 0.5000922021451794 and parameters: {'n_estimators': 345, 'max_depth': 767, 'max_features': 'log2', 'max_leaf_nodes': 85}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:51:29,584] Trial 22 finished with value: 0.5041902782920183 and parameters: {'n_estimators': 308, 'max_depth': 885, 'max_features': 'log2', 'max_leaf_nodes': 88}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:51:37,070] Trial 23 finished with value: 0.47197102436174054 and parameters: {'n_estimators': 189, 'max_depth': 670, 'max_features': 'sqrt', 'max_leaf_nodes': 75}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:03,106] Trial 24 finished with value: 0.4750128011518962 and parameters: {'n_estimators': 608, 'max_depth': 747, 'max_features': 'log2', 'max_leaf_nodes': 100}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:13,388] Trial 25 finished with value: 0.4981866173964878 and parameters: {'n_estimators': 246, 'max_depth': 911, 'max_features': 'sqrt', 'max_leaf_nodes': 79}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:29,662] Trial 26 finished with value: 0.4949574103683563 and parameters: {'n_estimators': 414, 'max_depth': 800, 'max_features': 'log2', 'max_leaf_nodes': 63}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:51,873] Trial 27 finished with value: 0.495756839808449 and parameters: {'n_estimators': 523, 'max_depth': 655, 'max_features': 'sqrt', 'max_leaf_nodes': 92}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:55,916] Trial 28 finished with value: 0.5199221725247709 and parameters: {'n_estimators': 92, 'max_depth': 520, 'max_features': 'auto', 'max_leaf_nodes': 88}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:52:57,238] Trial 29 finished with value: 0.4691930590294736 and parameters: {'n_estimators': 30, 'max_depth': 330, 'max_features': 'auto', 'max_leaf_nodes': 76}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:53:00,055] Trial 30 finished with value: 0.46693686124803424 and parameters: {'n_estimators': 66, 'max_depth': 256, 'max_features': 'auto', 'max_leaf_nodes': 72}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:53:09,975] Trial 31 finished with value: 0.5138073859803778 and parameters: {'n_estimators': 229, 'max_depth': 526, 'max_features': 'auto', 'max_leaf_nodes': 91}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:53:19,182] Trial 32 finished with value: 0.49534907024051844 and parameters: {'n_estimators': 214, 'max_depth': 490, 'max_features': 'auto', 'max_leaf_nodes': 92}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:53:23,809] Trial 33 finished with value: 0.4836521126170037 and parameters: {'n_estimators': 116, 'max_depth': 519, 'max_features': 'auto', 'max_leaf_nodes': 63}. Best is trial 6 with value: 0.5224287117097042. [I 2022-03-25 18:53:29,926] Trial 34 finished with value: 0.5317973740007639 and parameters: {'n_estimators': 141, 'max_depth': 453, 'max_features': 'auto', 'max_leaf_nodes': 94}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:53:35,672] Trial 35 finished with value: 0.5201634287804467 and parameters: {'n_estimators': 137, 'max_depth': 441, 'max_features': 'auto', 'max_leaf_nodes': 93}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:53:41,096] Trial 36 finished with value: 0.4770899398391231 and parameters: {'n_estimators': 124, 'max_depth': 405, 'max_features': 'auto', 'max_leaf_nodes': 99}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:53:44,521] Trial 37 finished with value: 0.4602250211392937 and parameters: {'n_estimators': 80, 'max_depth': 435, 'max_features': 'auto', 'max_leaf_nodes': 78}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:53:51,454] Trial 38 finished with value: 0.5035551749185121 and parameters: {'n_estimators': 161, 'max_depth': 295, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:03,045] Trial 39 finished with value: 0.5150599823904909 and parameters: {'n_estimators': 285, 'max_depth': 182, 'max_features': 'auto', 'max_leaf_nodes': 67}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:10,279] Trial 40 finished with value: 0.47075593994137027 and parameters: {'n_estimators': 172, 'max_depth': 596, 'max_features': 'auto', 'max_leaf_nodes': 82}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:12,746] Trial 41 finished with value: 0.49838312656226186 and parameters: {'n_estimators': 62, 'max_depth': 183, 'max_features': 'auto', 'max_leaf_nodes': 68}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:23,238] Trial 42 finished with value: 0.5050431161222122 and parameters: {'n_estimators': 248, 'max_depth': 150, 'max_features': 'auto', 'max_leaf_nodes': 89}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:34,875] Trial 43 finished with value: 0.49854248116331135 and parameters: {'n_estimators': 300, 'max_depth': 368, 'max_features': 'auto', 'max_leaf_nodes': 56}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:39,396] Trial 44 finished with value: 0.5148193507693588 and parameters: {'n_estimators': 121, 'max_depth': 91, 'max_features': 'auto', 'max_leaf_nodes': 47}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:54:55,904] Trial 45 finished with value: 0.4836218214774022 and parameters: {'n_estimators': 386, 'max_depth': 9, 'max_features': 'auto', 'max_leaf_nodes': 69}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:55:08,216] Trial 46 finished with value: 0.5195510580183286 and parameters: {'n_estimators': 281, 'max_depth': 567, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:55:29,687] Trial 47 finished with value: 0.5245750300688934 and parameters: {'n_estimators': 492, 'max_depth': 454, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:56:07,235] Trial 48 finished with value: 0.5095927140172948 and parameters: {'n_estimators': 855, 'max_depth': 453, 'max_features': 'auto', 'max_leaf_nodes': 96}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:56:27,392] Trial 49 finished with value: 0.5093527400844952 and parameters: {'n_estimators': 467, 'max_depth': 402, 'max_features': 'auto', 'max_leaf_nodes': 85}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:56:49,599] Trial 50 finished with value: 0.48959558857702007 and parameters: {'n_estimators': 634, 'max_depth': 353, 'max_features': 'auto', 'max_leaf_nodes': 32}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:57:14,065] Trial 51 finished with value: 0.5066871845854746 and parameters: {'n_estimators': 558, 'max_depth': 561, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:57:46,453] Trial 52 finished with value: 0.5024168862159986 and parameters: {'n_estimators': 741, 'max_depth': 464, 'max_features': 'auto', 'max_leaf_nodes': 92}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:57:53,030] Trial 53 finished with value: 0.49368167396819357 and parameters: {'n_estimators': 147, 'max_depth': 552, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:58:01,700] Trial 54 finished with value: 0.5270487707910801 and parameters: {'n_estimators': 205, 'max_depth': 491, 'max_features': 'auto', 'max_leaf_nodes': 88}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:58:02,307] Trial 55 finished with value: 0.4521799691720185 and parameters: {'n_estimators': 13, 'max_depth': 488, 'max_features': 'log2', 'max_leaf_nodes': 88}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:58:05,848] Trial 56 finished with value: 0.4732512639443472 and parameters: {'n_estimators': 85, 'max_depth': 630, 'max_features': 'auto', 'max_leaf_nodes': 81}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:58:11,695] Trial 57 finished with value: 0.4917403529294589 and parameters: {'n_estimators': 198, 'max_depth': 314, 'max_features': 'log2', 'max_leaf_nodes': 16}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:58:41,549] Trial 58 finished with value: 0.5264048228433643 and parameters: {'n_estimators': 693, 'max_depth': 407, 'max_features': 'auto', 'max_leaf_nodes': 87}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:59:05,526] Trial 59 finished with value: 0.4866273170572835 and parameters: {'n_estimators': 679, 'max_depth': 428, 'max_features': 'log2', 'max_leaf_nodes': 36}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 18:59:39,932] Trial 60 finished with value: 0.5215255000037149 and parameters: {'n_estimators': 813, 'max_depth': 400, 'max_features': 'auto', 'max_leaf_nodes': 82}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:00:13,023] Trial 61 finished with value: 0.5150069643051489 and parameters: {'n_estimators': 796, 'max_depth': 384, 'max_features': 'auto', 'max_leaf_nodes': 82}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:00:52,421] Trial 62 finished with value: 0.5066538788088425 and parameters: {'n_estimators': 920, 'max_depth': 472, 'max_features': 'auto', 'max_leaf_nodes': 86}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:01:27,067] Trial 63 finished with value: 0.5029478193578334 and parameters: {'n_estimators': 840, 'max_depth': 272, 'max_features': 'auto', 'max_leaf_nodes': 72}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:01:58,022] Trial 64 finished with value: 0.5112397152983802 and parameters: {'n_estimators': 711, 'max_depth': 345, 'max_features': 'auto', 'max_leaf_nodes': 97}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:02:37,772] Trial 65 finished with value: 0.5107308666908434 and parameters: {'n_estimators': 915, 'max_depth': 427, 'max_features': 'auto', 'max_leaf_nodes': 92}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:02:48,180] Trial 66 finished with value: 0.31162526350770664 and parameters: {'n_estimators': 565, 'max_depth': 513, 'max_features': 'log2', 'max_leaf_nodes': 2}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:03:20,275] Trial 67 finished with value: 0.5004512577817662 and parameters: {'n_estimators': 777, 'max_depth': 450, 'max_features': 'auto', 'max_leaf_nodes': 76}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:03:47,623] Trial 68 finished with value: 0.5090305462936562 and parameters: {'n_estimators': 650, 'max_depth': 391, 'max_features': 'auto', 'max_leaf_nodes': 90}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:04:28,490] Trial 69 finished with value: 0.5093839657962431 and parameters: {'n_estimators': 974, 'max_depth': 497, 'max_features': 'log2', 'max_leaf_nodes': 85}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:04:48,787] Trial 70 finished with value: 0.5203829692681539 and parameters: {'n_estimators': 484, 'max_depth': 305, 'max_features': 'sqrt', 'max_leaf_nodes': 79}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:05:11,213] Trial 71 finished with value: 0.505453036217256 and parameters: {'n_estimators': 584, 'max_depth': 246, 'max_features': 'sqrt', 'max_leaf_nodes': 56}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:05:31,678] Trial 72 finished with value: 0.50912685717311 and parameters: {'n_estimators': 491, 'max_depth': 316, 'max_features': 'sqrt', 'max_leaf_nodes': 79}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:05:54,399] Trial 73 finished with value: 0.5274706720993672 and parameters: {'n_estimators': 527, 'max_depth': 366, 'max_features': 'sqrt', 'max_leaf_nodes': 94}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:06:16,159] Trial 74 finished with value: 0.502782770066728 and parameters: {'n_estimators': 508, 'max_depth': 234, 'max_features': 'sqrt', 'max_leaf_nodes': 86}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:06:31,246] Trial 75 finished with value: 0.46897085919255466 and parameters: {'n_estimators': 381, 'max_depth': 368, 'max_features': 'sqrt', 'max_leaf_nodes': 73}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:07:00,885] Trial 76 finished with value: 0.4896902360371527 and parameters: {'n_estimators': 712, 'max_depth': 411, 'max_features': 'sqrt', 'max_leaf_nodes': 80}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:07:21,097] Trial 77 finished with value: 0.49097542393097315 and parameters: {'n_estimators': 622, 'max_depth': 220, 'max_features': 'sqrt', 'max_leaf_nodes': 24}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:07:57,220] Trial 78 finished with value: 0.5020658939579392 and parameters: {'n_estimators': 824, 'max_depth': 315, 'max_features': 'sqrt', 'max_leaf_nodes': 98}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:08:15,822] Trial 79 finished with value: 0.4971584716185769 and parameters: {'n_estimators': 434, 'max_depth': 290, 'max_features': 'sqrt', 'max_leaf_nodes': 88}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:08:46,834] Trial 80 finished with value: 0.4950860637055101 and parameters: {'n_estimators': 747, 'max_depth': 346, 'max_features': 'sqrt', 'max_leaf_nodes': 83}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:09:10,004] Trial 81 finished with value: 0.5098326913666125 and parameters: {'n_estimators': 526, 'max_depth': 444, 'max_features': 'auto', 'max_leaf_nodes': 93}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:09:30,258] Trial 82 finished with value: 0.4954337896311273 and parameters: {'n_estimators': 460, 'max_depth': 484, 'max_features': 'auto', 'max_leaf_nodes': 95}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:09:43,939] Trial 83 finished with value: 0.4864895401588491 and parameters: {'n_estimators': 322, 'max_depth': 534, 'max_features': 'auto', 'max_leaf_nodes': 89}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:10:09,395] Trial 84 finished with value: 0.4792455601170952 and parameters: {'n_estimators': 587, 'max_depth': 389, 'max_features': 'sqrt', 'max_leaf_nodes': 94}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:10:47,826] Trial 85 finished with value: 0.5139651048929613 and parameters: {'n_estimators': 879, 'max_depth': 411, 'max_features': 'log2', 'max_leaf_nodes': 98}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:11:05,239] Trial 86 finished with value: 0.5019391212484098 and parameters: {'n_estimators': 401, 'max_depth': 965, 'max_features': 'auto', 'max_leaf_nodes': 90}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:11:30,246] Trial 87 finished with value: 0.4992577047839486 and parameters: {'n_estimators': 659, 'max_depth': 577, 'max_features': 'auto', 'max_leaf_nodes': 46}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:11:51,350] Trial 88 finished with value: 0.4729593225547834 and parameters: {'n_estimators': 487, 'max_depth': 362, 'max_features': 'auto', 'max_leaf_nodes': 100}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:11:53,691] Trial 89 finished with value: 0.4810814315874636 and parameters: {'n_estimators': 50, 'max_depth': 464, 'max_features': 'sqrt', 'max_leaf_nodes': 84}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:11:59,789] Trial 90 finished with value: 0.4711024024401748 and parameters: {'n_estimators': 148, 'max_depth': 621, 'max_features': 'auto', 'max_leaf_nodes': 77}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:03,606] Trial 91 finished with value: 0.5311042702746539 and parameters: {'n_estimators': 93, 'max_depth': 701, 'max_features': 'auto', 'max_leaf_nodes': 87}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:08,499] Trial 92 finished with value: 0.46732197535144887 and parameters: {'n_estimators': 115, 'max_depth': 742, 'max_features': 'auto', 'max_leaf_nodes': 87}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:16,462] Trial 93 finished with value: 0.5053869291504746 and parameters: {'n_estimators': 187, 'max_depth': 890, 'max_features': 'auto', 'max_leaf_nodes': 91}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:20,812] Trial 94 finished with value: 0.5173519386374887 and parameters: {'n_estimators': 101, 'max_depth': 821, 'max_features': 'auto', 'max_leaf_nodes': 93}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:22,847] Trial 95 finished with value: 0.48640195589448165 and parameters: {'n_estimators': 48, 'max_depth': 939, 'max_features': 'auto', 'max_leaf_nodes': 83}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:31,514] Trial 96 finished with value: 0.49798887351550786 and parameters: {'n_estimators': 221, 'max_depth': 704, 'max_features': 'log2', 'max_leaf_nodes': 65}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:12:41,865] Trial 97 finished with value: 0.5106090243481592 and parameters: {'n_estimators': 258, 'max_depth': 429, 'max_features': 'auto', 'max_leaf_nodes': 74}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:13:05,755] Trial 98 finished with value: 0.5110592149146701 and parameters: {'n_estimators': 538, 'max_depth': 544, 'max_features': 'auto', 'max_leaf_nodes': 96}. Best is trial 34 with value: 0.5317973740007639. [I 2022-03-25 19:13:20,281] Trial 99 finished with value: 0.5103689029112757 and parameters: {'n_estimators': 362, 'max_depth': 790, 'max_features': 'auto', 'max_leaf_nodes': 81}. Best is trial 34 with value: 0.5317973740007639.
trial = study.best_trial
print('PR AUC Value: {}'.format(trial.value))
PR AUC Value: 0.5317973740007639
print("Best hyperparameters: {}".format(trial.params))
Best hyperparameters: {'n_estimators': 141, 'max_depth': 453, 'max_features': 'auto', 'max_leaf_nodes': 94}
optuna.visualization.plot_slice(study)
optuna.visualization.plot_optimization_history(study)
kf = sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
score = sklearn.model_selection.cross_val_score(RandomForestClassifier(max_depth = study.best_params['max_depth'],
max_features = study.best_params['max_features'],
max_leaf_nodes = study.best_params['max_leaf_nodes'],
n_estimators = study.best_params['n_estimators'],
class_weight='balanced',
n_jobs=1),
X_train_temp, y_train, cv= kf, scoring="recall")
print(f'Scores for each fold are: {score}')
print(f'Average Recall: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.53736089 0.54809221 0.54531002] Average Recall: 0.54
optimised_rf = RandomForestClassifier(max_depth = study.best_params['max_depth'],
max_features = study.best_params['max_features'],
max_leaf_nodes = study.best_params['max_leaf_nodes'],
n_estimators = study.best_params['n_estimators'],
class_weight='balanced',
n_jobs=-1)
optimised_rf.fit(X_train_temp,y_train)
RandomForestClassifier(class_weight='balanced', max_depth=453,
max_leaf_nodes=94, n_estimators=141, n_jobs=-1)
y_hat = optimised_rf.predict(x_test_temp)
y_hat
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
prauc = average_precision_score(y_test,y_hat)
print("Area Under Precision Recall Curve:",prauc)
Area Under Precision Recall Curve: 0.5080392374521192
print(classification_report(y_test,y_hat,target_names=['Class 0','Class 1']))
precision recall f1-score support
Class 0 0.95 0.99 0.97 35883
Class 1 0.88 0.53 0.66 3717
accuracy 0.95 39600
macro avg 0.92 0.76 0.82 39600
weighted avg 0.95 0.95 0.94 39600
Here, the Recall is same as we got from Stratified k-fold, which suggests that our model is not overfitting
precision,recall,threshold = precision_recall_curve(y_test,y_hat)
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Tradeoff')
plt.show()
Here, though our recall is quite low, we can change our threshold to increase the recall which would decrease precision but that is a trade-off that we have to consider.
Referred Links:
https://machinelearningmastery.com/feature-selection-with-categorical-data/
https://medium.com/adj2141/credit-card-fraud-detection-using-machine-learning-899af62df3ab
https://mlopshowto.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
https://stackoverflow.com/questions/40739152/how-to-use-sklearn-featurehasher
https://analyticsindiamag.com/python-guide-to-precision-recall-tradeoff/
https://glassboxmedicine.com/2019/03/02/measuring-performance-auprc/
https://stats.stackexchange.com/questions/113326/what-is-a-good-auc-for-a-precision-recall-curve
http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html